You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Pedro Tuero <tu...@gmail.com> on 2019/02/01 16:11:49 UTC

Re: Aws

Hi Hiroyuki, thanks for the answer.

I found a solution for the cores per executor configuration:
I set this configuration to true:
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html#emr-spark-maximizeresourceallocation
Probably it was true by default at version 5.16, but I didn't find when it
has changed.
In the same link, it says that dynamic allocation is true by default. I
thought it would do the trick but reading again I think it is related to
the number of executors rather than the number of cores.

But the jobs are still taking more than before.
Watching application history,  I see these differences:
For the same job, the same kind of instances types, default (aws managed)
configuration for executors, cores, and memory:
Instances:
6 r5.xlarge :  4 vCpu , 32gb of mem. (So there is 24 cores: 6 instances * 4
cores).

With 5.16:
- 24 executors  (4 in each instance, including the one who also had the
driver).
- 4 cores each.
- 2.7  * 2 (Storage + on-heap storage) memory each.
- 1 executor per core, but at the same time  4 cores per executor (?).
- Total Mem in executors per Instance : 21.6 (2.7 * 2 * 4)
- Total Elapsed Time: 6 minutes
With 5.20:
- 5 executors (1 in each instance, 0 in the instance with the driver).
- 4 cores each.
- 11.9  * 2 (Storage + on-heap storage) memory each.
- Total Mem  in executors per Instance : 23.8 (11.9 * 2 * 1)
- Total Elapsed Time: 8 minutes


I don't understand the configuration of 5.16, but it works better.
It seems that in 5.20, a full instance is wasted with the driver only,
while it could also contain an executor.


Regards,
Pedro.



l jue., 31 de ene. de 2019 20:16, Hiroyuki Nagata <id...@gmail.com>
escribió:

> Hi, Pedro
>
>
> I also start using AWS EMR, with Spark 2.4.0. I'm seeking methods for
> performance tuning.
>
> Do you configure dynamic allocation ?
>
> FYI:
>
> https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
>
> I've not tested it yet. I guess spark-submit needs to specify number of
> executors.
>
> Regards,
> Hiroyuki
>
> 2019年2月1日(金) 5:23、Pedro Tuero さん（tueropedro@gmail.com）のメッセージ:
>
>> Hi guys,
>> I use to run spark jobs in Aws emr.
>> Recently I switch from aws emr label  5.16 to 5.20 (which use Spark
>> 2.4.0).
>> I've noticed that a lot of steps are taking longer than before.
>> I think it is related to the automatic configuration of cores by executor.
>> In version 5.16, some executors toke more cores if the instance allows it.
>> Let say, if an instance had 8 cores and 40gb of ram, and ram configured
>> by executor was 10gb, then aws emr automatically assigned 2 cores by
>> executor.
>> Now in label 5.20, unless I configure the number of cores manually, only
>> one core is assigned per executor.
>>
>> I don't know if it is related to Spark 2.4.0 or if it is something
>> managed by aws...
>> Does anyone know if there is a way to automatically use more cores when
>> it is physically possible?
>>
>> Thanks,
>> Peter.
>>
>

Re: Aws

Posted by Pedro Tuero <tu...@gmail.com>.

Hi Noritaka,

I start clusters from Java API.
Clusters running on 5.16 have not manual configurations in the Emr console
Configuration tab, so I assume the value of this property should be the
default on 5.16.
I enabled maximize resource allocation because otherwise, the number of
cores automatically assigned (without assigning spark.executor.cores
manually) was always one per executor.

I already use the same configurations. I used the same scripts and
configuration files for running the same job with same input data with the
same configuration, only changing the binaries with my own code which
include launching the clusters using emr 5.20 release label.

Anyway, setting maximize resource allocation seems to have helped with the
cores distribution enough.
Some jobs take even less than before.
Now I'm stuck analyzing a case where the number of tasks created seems to
be the problem. I have posted in this forum another thread about that
recently.

Regards,
Pedro


El jue., 7 de feb. de 2019 a la(s) 21:37, Noritaka Sekiyama (
moomindani@gmail.com) escribió:

> Hi Pedro,
>
> It seems that you disabled maximize resource allocation in 5.16, but
> enabled in 5.20.
> This config can be different based on how you start EMR cluster (via quick
> wizard, advanced wizard in console, or CLI/API).
> You can see that in EMR console Configuration tab.
>
> Please compare spark properties (especially spark.executor.cores,
> spark.executor.memory, spark.dynamicAllocation.enabled, etc.)  between
> your two Spark cluster with different version of EMR.
> You can see them from Spark web UI’s environment tab or log files.
> Then please try with the same properties against the same dataset with the
> same deployment mode (cluster or client).
>
> Even in EMR, you can configure num of cores and memory of driver/executors
> in config files, arguments in spark-submit, and inside Spark app if you
> need.
>
>
> Warm regards,
> Nori
>
> 2019年2月8日(金) 8:16 Hiroyuki Nagata <id...@gmail.com>:
>
>> Hi,
>> thank you Pedro
>>
>> I tested maximizeResourceAllocation option. When it's enabled, it seems
>> Spark utilized their cores fully. However the performance is not so
>> different from default setting.
>>
>> I consider to use s3-distcp for uploading files. And, I think
>> table(dataframe) caching is also effectiveness.
>>
>> Regards,
>> Hiroyuki
>>
>> 2019年2月2日(土) 1:12 Pedro Tuero <tu...@gmail.com>:
>>
>>> Hi Hiroyuki, thanks for the answer.
>>>
>>> I found a solution for the cores per executor configuration:
>>> I set this configuration to true:
>>>
>>> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html#emr-spark-maximizeresourceallocation
>>> Probably it was true by default at version 5.16, but I didn't find when
>>> it has changed.
>>> In the same link, it says that dynamic allocation is true by default. I
>>> thought it would do the trick but reading again I think it is related to
>>> the number of executors rather than the number of cores.
>>>
>>> But the jobs are still taking more than before.
>>> Watching application history,  I see these differences:
>>> For the same job, the same kind of instances types, default (aws
>>> managed) configuration for executors, cores, and memory:
>>> Instances:
>>> 6 r5.xlarge :  4 vCpu , 32gb of mem. (So there is 24 cores: 6 instances
>>> * 4 cores).
>>>
>>> With 5.16:
>>> - 24 executors  (4 in each instance, including the one who also had the
>>> driver).
>>> - 4 cores each.
>>> - 2.7  * 2 (Storage + on-heap storage) memory each.
>>> - 1 executor per core, but at the same time  4 cores per executor (?).
>>> - Total Mem in executors per Instance : 21.6 (2.7 * 2 * 4)
>>> - Total Elapsed Time: 6 minutes
>>> With 5.20:
>>> - 5 executors (1 in each instance, 0 in the instance with the driver).
>>> - 4 cores each.
>>> - 11.9  * 2 (Storage + on-heap storage) memory each.
>>> - Total Mem  in executors per Instance : 23.8 (11.9 * 2 * 1)
>>> - Total Elapsed Time: 8 minutes
>>>
>>>
>>> I don't understand the configuration of 5.16, but it works better.
>>> It seems that in 5.20, a full instance is wasted with the driver only,
>>> while it could also contain an executor.
>>>
>>>
>>> Regards,
>>> Pedro.
>>>
>>>
>>>
>>> l jue., 31 de ene. de 2019 20:16, Hiroyuki Nagata <id...@gmail.com>
>>> escribió:
>>>
>>>> Hi, Pedro
>>>>
>>>>
>>>> I also start using AWS EMR, with Spark 2.4.0. I'm seeking methods for
>>>> performance tuning.
>>>>
>>>> Do you configure dynamic allocation ?
>>>>
>>>> FYI:
>>>>
>>>> https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
>>>>
>>>> I've not tested it yet. I guess spark-submit needs to specify number of
>>>> executors.
>>>>
>>>> Regards,
>>>> Hiroyuki
>>>>
>>>> 2019年2月1日(金) 5:23、Pedro Tuero さん（tueropedro@gmail.com）のメッセージ:
>>>>
>>>>> Hi guys,
>>>>> I use to run spark jobs in Aws emr.
>>>>> Recently I switch from aws emr label  5.16 to 5.20 (which use Spark
>>>>> 2.4.0).
>>>>> I've noticed that a lot of steps are taking longer than before.
>>>>> I think it is related to the automatic configuration of cores by
>>>>> executor.
>>>>> In version 5.16, some executors toke more cores if the instance allows
>>>>> it.
>>>>> Let say, if an instance had 8 cores and 40gb of ram, and ram
>>>>> configured by executor was 10gb, then aws emr automatically assigned 2
>>>>> cores by executor.
>>>>> Now in label 5.20, unless I configure the number of cores manually,
>>>>> only one core is assigned per executor.
>>>>>
>>>>> I don't know if it is related to Spark 2.4.0 or if it is something
>>>>> managed by aws...
>>>>> Does anyone know if there is a way to automatically use more cores
>>>>> when it is physically possible?
>>>>>
>>>>> Thanks,
>>>>> Peter.
>>>>>
>>>>

Re: Aws

Posted by Noritaka Sekiyama <mo...@gmail.com>.

Hi Pedro,

It seems that you disabled maximize resource allocation in 5.16, but
enabled in 5.20.
This config can be different based on how you start EMR cluster (via quick
wizard, advanced wizard in console, or CLI/API).
You can see that in EMR console Configuration tab.

Please compare spark properties (especially spark.executor.cores,
spark.executor.memory, spark.dynamicAllocation.enabled, etc.)  between your
two Spark cluster with different version of EMR.
You can see them from Spark web UI’s environment tab or log files.
Then please try with the same properties against the same dataset with the
same deployment mode (cluster or client).

Even in EMR, you can configure num of cores and memory of driver/executors
in config files, arguments in spark-submit, and inside Spark app if you
need.


Warm regards,
Nori

2019年2月8日(金) 8:16 Hiroyuki Nagata <id...@gmail.com>:

> Hi,
> thank you Pedro
>
> I tested maximizeResourceAllocation option. When it's enabled, it seems
> Spark utilized their cores fully. However the performance is not so
> different from default setting.
>
> I consider to use s3-distcp for uploading files. And, I think
> table(dataframe) caching is also effectiveness.
>
> Regards,
> Hiroyuki
>
> 2019年2月2日(土) 1:12 Pedro Tuero <tu...@gmail.com>:
>
>> Hi Hiroyuki, thanks for the answer.
>>
>> I found a solution for the cores per executor configuration:
>> I set this configuration to true:
>>
>> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html#emr-spark-maximizeresourceallocation
>> Probably it was true by default at version 5.16, but I didn't find when
>> it has changed.
>> In the same link, it says that dynamic allocation is true by default. I
>> thought it would do the trick but reading again I think it is related to
>> the number of executors rather than the number of cores.
>>
>> But the jobs are still taking more than before.
>> Watching application history,  I see these differences:
>> For the same job, the same kind of instances types, default (aws managed)
>> configuration for executors, cores, and memory:
>> Instances:
>> 6 r5.xlarge :  4 vCpu , 32gb of mem. (So there is 24 cores: 6 instances *
>> 4 cores).
>>
>> With 5.16:
>> - 24 executors  (4 in each instance, including the one who also had the
>> driver).
>> - 4 cores each.
>> - 2.7  * 2 (Storage + on-heap storage) memory each.
>> - 1 executor per core, but at the same time  4 cores per executor (?).
>> - Total Mem in executors per Instance : 21.6 (2.7 * 2 * 4)
>> - Total Elapsed Time: 6 minutes
>> With 5.20:
>> - 5 executors (1 in each instance, 0 in the instance with the driver).
>> - 4 cores each.
>> - 11.9  * 2 (Storage + on-heap storage) memory each.
>> - Total Mem  in executors per Instance : 23.8 (11.9 * 2 * 1)
>> - Total Elapsed Time: 8 minutes
>>
>>
>> I don't understand the configuration of 5.16, but it works better.
>> It seems that in 5.20, a full instance is wasted with the driver only,
>> while it could also contain an executor.
>>
>>
>> Regards,
>> Pedro.
>>
>>
>>
>> l jue., 31 de ene. de 2019 20:16, Hiroyuki Nagata <id...@gmail.com>
>> escribió:
>>
>>> Hi, Pedro
>>>
>>>
>>> I also start using AWS EMR, with Spark 2.4.0. I'm seeking methods for
>>> performance tuning.
>>>
>>> Do you configure dynamic allocation ?
>>>
>>> FYI:
>>>
>>> https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
>>>
>>> I've not tested it yet. I guess spark-submit needs to specify number of
>>> executors.
>>>
>>> Regards,
>>> Hiroyuki
>>>
>>> 2019年2月1日(金) 5:23、Pedro Tuero さん（tueropedro@gmail.com）のメッセージ:
>>>
>>>> Hi guys,
>>>> I use to run spark jobs in Aws emr.
>>>> Recently I switch from aws emr label  5.16 to 5.20 (which use Spark
>>>> 2.4.0).
>>>> I've noticed that a lot of steps are taking longer than before.
>>>> I think it is related to the automatic configuration of cores by
>>>> executor.
>>>> In version 5.16, some executors toke more cores if the instance allows
>>>> it.
>>>> Let say, if an instance had 8 cores and 40gb of ram, and ram configured
>>>> by executor was 10gb, then aws emr automatically assigned 2 cores by
>>>> executor.
>>>> Now in label 5.20, unless I configure the number of cores manually,
>>>> only one core is assigned per executor.
>>>>
>>>> I don't know if it is related to Spark 2.4.0 or if it is something
>>>> managed by aws...
>>>> Does anyone know if there is a way to automatically use more cores when
>>>> it is physically possible?
>>>>
>>>> Thanks,
>>>> Peter.
>>>>
>>>

Re: Aws

Posted by Hiroyuki Nagata <id...@gmail.com>.

Hi,
thank you Pedro

I tested maximizeResourceAllocation option. When it's enabled, it seems
Spark utilized their cores fully. However the performance is not so
different from default setting.

I consider to use s3-distcp for uploading files. And, I think
table(dataframe) caching is also effectiveness.

Regards,
Hiroyuki

2019年2月2日(土) 1:12 Pedro Tuero <tu...@gmail.com>:

> Hi Hiroyuki, thanks for the answer.
>
> I found a solution for the cores per executor configuration:
> I set this configuration to true:
>
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html#emr-spark-maximizeresourceallocation
> Probably it was true by default at version 5.16, but I didn't find when it
> has changed.
> In the same link, it says that dynamic allocation is true by default. I
> thought it would do the trick but reading again I think it is related to
> the number of executors rather than the number of cores.
>
> But the jobs are still taking more than before.
> Watching application history,  I see these differences:
> For the same job, the same kind of instances types, default (aws managed)
> configuration for executors, cores, and memory:
> Instances:
> 6 r5.xlarge :  4 vCpu , 32gb of mem. (So there is 24 cores: 6 instances *
> 4 cores).
>
> With 5.16:
> - 24 executors  (4 in each instance, including the one who also had the
> driver).
> - 4 cores each.
> - 2.7  * 2 (Storage + on-heap storage) memory each.
> - 1 executor per core, but at the same time  4 cores per executor (?).
> - Total Mem in executors per Instance : 21.6 (2.7 * 2 * 4)
> - Total Elapsed Time: 6 minutes
> With 5.20:
> - 5 executors (1 in each instance, 0 in the instance with the driver).
> - 4 cores each.
> - 11.9  * 2 (Storage + on-heap storage) memory each.
> - Total Mem  in executors per Instance : 23.8 (11.9 * 2 * 1)
> - Total Elapsed Time: 8 minutes
>
>
> I don't understand the configuration of 5.16, but it works better.
> It seems that in 5.20, a full instance is wasted with the driver only,
> while it could also contain an executor.
>
>
> Regards,
> Pedro.
>
>
>
> l jue., 31 de ene. de 2019 20:16, Hiroyuki Nagata <id...@gmail.com>
> escribió:
>
>> Hi, Pedro
>>
>>
>> I also start using AWS EMR, with Spark 2.4.0. I'm seeking methods for
>> performance tuning.
>>
>> Do you configure dynamic allocation ?
>>
>> FYI:
>>
>> https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
>>
>> I've not tested it yet. I guess spark-submit needs to specify number of
>> executors.
>>
>> Regards,
>> Hiroyuki
>>
>> 2019年2月1日(金) 5:23、Pedro Tuero さん（tueropedro@gmail.com）のメッセージ:
>>
>>> Hi guys,
>>> I use to run spark jobs in Aws emr.
>>> Recently I switch from aws emr label  5.16 to 5.20 (which use Spark
>>> 2.4.0).
>>> I've noticed that a lot of steps are taking longer than before.
>>> I think it is related to the automatic configuration of cores by
>>> executor.
>>> In version 5.16, some executors toke more cores if the instance allows
>>> it.
>>> Let say, if an instance had 8 cores and 40gb of ram, and ram configured
>>> by executor was 10gb, then aws emr automatically assigned 2 cores by
>>> executor.
>>> Now in label 5.20, unless I configure the number of cores manually, only
>>> one core is assigned per executor.
>>>
>>> I don't know if it is related to Spark 2.4.0 or if it is something
>>> managed by aws...
>>> Does anyone know if there is a way to automatically use more cores when
>>> it is physically possible?
>>>
>>> Thanks,
>>> Peter.
>>>
>>