You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Adam Roberts <AR...@uk.ibm.com> on 2016/07/08 10:17:09 UTC
Spark 2.0.0 performance; potential large Spark core regression
Hi, we've been testing the performance of Spark 2.0 compared to previous
releases, unfortunately there are no Spark 2.0 compatible versions of
HiBench and SparkPerf apart from those I'm working on (see
https://github.com/databricks/spark-perf/issues/108)
With the Spark 2.0 version of SparkPerf we've noticed a 30% geomean
regression with a very small scale factor and so we've generated a couple
of profiles comparing 1.5.2 vs 2.0.0. Same JDK version and same platform.
We will gather a 1.6.2 comparison and increase the scale factor.
Has anybody noticed a similar problem? My changes for SparkPerf and Spark
2.0 are very limited and AFAIK don't interfere with Spark core
functionality, so any feedback on the changes would be much appreciated
and welcome, I'd much prefer it if my changes are the problem.
A summary for your convenience follows (this matches what I've mentioned
on the SparkPerf issue above)
1. spark-perf/config/config.py : SCALE_FACTOR=0.05
No. Of Workers: 1
Executor per Worker : 1
Executor Memory: 18G
Driver Memory : 8G
Serializer: kryo
2. $SPARK_HOME/conf/spark-defaults.conf: executor Java Options:
-Xdisableexplicitgc -Xcompressedrefs
Main changes I made for the benchmark itself
Use Scala 2.11.8 and Spark 2.0.0 RC2 on our local filesystem
MLAlgorithmTests use Vectors.fromML
For streaming-tests in HdfsRecoveryTest we use wordStream.foreachRDD not
wordStream.foreach
KVDataTest uses awaitTerminationOrTimeout in a SparkStreamingContext
instead of awaitTermination
Trivial: we use compact not compact.render for outputting json
In Spark 2.0 the top five methods where we spend our time is as follows,
the percentage is how much of the overall processing time was spent in
this particular method:
1. AppendOnlyMap.changeValue 44%
2. SortShuffleWriter.write 19%
3. SizeTracker.estimateSize 7.5%
4. SizeEstimator.estimate 5.36%
5. Range.foreach 3.6%
and in 1.5.2 the top five methods are:
1. AppendOnlyMap.changeValue 38%
2. ExternalSorter.insertAll 33%
3. Range.foreach 4%
4. SizeEstimator.estimate 2%
5. SizeEstimator.visitSingleObject 2%
I see the following scores, on the left I have the test name followed by
the 1.5.2 time and then the 2.0.0 time
scheduling throughput: 5.2s vs 7.08s
agg by key; 0.72s vs 1.01s
agg by key int: 0.93s vs 1.19s
agg by key naive: 1.88s vs 2.02
sort by key: 0.64s vs 0.8s
sort by key int: 0.59s vs 0.64s
scala count: 0.09s vs 0.08s
scala count w fltr: 0.31s vs 0.47s
This is only running the Spark core tests (scheduling throughput through
scala-count-w-filtr, including all inbetween).
Cheers,
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Re: Spark 2.0.0 performance; potential large Spark core regression
Posted by Adam Roberts <AR...@uk.ibm.com>.
Ted,
That bug was https://issues.apache.org/jira/browse/SPARK-15822 and only
found as part of running an sql-flights application (not with the unit
tests), I don't know if this has anything to do with the regression we're
seeing
One update is that we see the same ballpark regression for 1.6.2 vs 2.0
with HiBench (large profile, 25g executor memory, 4g driver), again we
will be carefully checking how these benchmarks are being run and what
difference the options and configurations can make
Cheers,
From: Ted Yu <yu...@gmail.com>
To: Adam Roberts/UK/IBM@IBMGB
Cc: Michael Allman <mi...@videoamp.com>, dev <de...@spark.apache.org>
Date: 08/07/2016 17:26
Subject: Re: Spark 2.0.0 performance; potential large Spark core
regression
bq. we turned it off when fixing a bug
Adam:
Can you refer to the bug JIRA ?
Thanks
On Fri, Jul 8, 2016 at 9:22 AM, Adam Roberts <AR...@uk.ibm.com> wrote:
Thanks Michael, we can give your options a try and aim for a 2.0.0 tuned
vs 2.0.0 default vs 1.6.2 default comparison, for future reference the
defaults in Spark 2 RC2 look to be:
sql.shuffle.partitions: 200
Tungsten enabled: true
Executor memory: 1 GB (we set to 18 GB)
kryo buffer max: 64mb
WholeStageCodegen: on I think, we turned it off when fixing a bug
offHeap.enabled: false
offHeap.size: 0
Cheers,
From: Michael Allman <mi...@videoamp.com>
To: Adam Roberts/UK/IBM@IBMGB
Cc: dev <de...@spark.apache.org>
Date: 08/07/2016 17:05
Subject: Re: Spark 2.0.0 performance; potential large Spark core
regression
Here are some settings we use for some very large GraphX jobs. These are
based on using EC2 c3.8xl workers:
.set("spark.sql.shuffle.partitions", "1024")
.set("spark.sql.tungsten.enabled", "true")
.set("spark.executor.memory", "24g")
.set("spark.kryoserializer.buffer.max","1g")
.set("spark.sql.codegen.wholeStage", "true")
.set("spark.memory.offHeap.enabled", "true")
.set("spark.memory.offHeap.size", "25769803776") // 24 GB
Some of these are in fact default configurations. Some are not.
Michael
On Jul 8, 2016, at 9:01 AM, Michael Allman <mi...@videoamp.com> wrote:
Hi Adam,
From our experience we've found the default Spark 2.0 configuration to be
highly suboptimal. I don't know if this affects your benchmarks, but I
would consider running some tests with tuned and alternate configurations.
Michael
On Jul 8, 2016, at 8:58 AM, Adam Roberts <ar...@uk.ibm.com> wrote:
Hi Michael, the two Spark configuration files aren't very exciting
spark-env.sh
Same as the template apart from a JAVA_HOME setting
spark-defaults.conf
spark.io.compression.codec lzf
config.py has the Spark home set, is running Spark standalone mode, we run
and prep Spark tests only, driver 8g, executor memory 16g, Kryo, 0.66
memory fraction, 100 trials
We can post the 1.6.2 comparison early next week, running lots of
iterations over the weekend once we get the dedicated time again
Cheers,
From: Michael Allman <mi...@videoamp.com>
To: Adam Roberts/UK/IBM@IBMGB
Cc: dev <de...@spark.apache.org>
Date: 08/07/2016 16:44
Subject: Re: Spark 2.0.0 performance; potential large Spark core
regression
Hi Adam,
Do you have your spark confs and your spark-env.sh somewhere where we can
see them? If not, can you make them available?
Cheers,
Michael
On Jul 8, 2016, at 3:17 AM, Adam Roberts <ar...@uk.ibm.com> wrote:
Hi, we've been testing the performance of Spark 2.0 compared to previous
releases, unfortunately there are no Spark 2.0 compatible versions of
HiBench and SparkPerf apart from those I'm working on (see
https://github.com/databricks/spark-perf/issues/108)
With the Spark 2.0 version of SparkPerf we've noticed a 30% geomean
regression with a very small scale factor and so we've generated a couple
of profiles comparing 1.5.2 vs 2.0.0. Same JDK version and same platform.
We will gather a 1.6.2 comparison and increase the scale factor.
Has anybody noticed a similar problem? My changes for SparkPerf and Spark
2.0 are very limited and AFAIK don't interfere with Spark core
functionality, so any feedback on the changes would be much appreciated
and welcome, I'd much prefer it if my changes are the problem.
A summary for your convenience follows (this matches what I've mentioned
on the SparkPerf issue above)
1. spark-perf/config/config.py : SCALE_FACTOR=0.05
No. Of Workers: 1
Executor per Worker : 1
Executor Memory: 18G
Driver Memory : 8G
Serializer: kryo
2. $SPARK_HOME/conf/spark-defaults.conf: executor Java Options:
-Xdisableexplicitgc -Xcompressedrefs
Main changes I made for the benchmark itself
Use Scala 2.11.8 and Spark 2.0.0 RC2 on our local filesystem
MLAlgorithmTests use Vectors.fromML
For streaming-tests in HdfsRecoveryTest we use wordStream.foreachRDD not
wordStream.foreach
KVDataTest uses awaitTerminationOrTimeout in a SparkStreamingContext
instead of awaitTermination
Trivial: we use compact not compact.render for outputting json
In Spark 2.0 the top five methods where we spend our time is as follows,
the percentage is how much of the overall processing time was spent in
this particular method:
1. AppendOnlyMap.changeValue 44%
2. SortShuffleWriter.write 19%
3. SizeTracker.estimateSize 7.5%
4. SizeEstimator.estimate 5.36%
5. Range.foreach 3.6%
and in 1.5.2 the top five methods are:
1. AppendOnlyMap.changeValue 38%
2. ExternalSorter.insertAll 33%
3. Range.foreach 4%
4. SizeEstimator.estimate 2%
5. SizeEstimator.visitSingleObject 2%
I see the following scores, on the left I have the test name followed by
the 1.5.2 time and then the 2.0.0 time
scheduling throughput: 5.2s vs 7.08s
agg by key; 0.72s vs 1.01s
agg by key int: 0.93s vs 1.19s
agg by key naive: 1.88s vs 2.02
sort by key: 0.64s vs 0.8s
sort by key int: 0.59s vs 0.64s
scala count: 0.09s vs 0.08s
scala count w fltr: 0.31s vs 0.47s
This is only running the Spark core tests (scheduling throughput through
scala-count-w-filtr, including all inbetween).
Cheers,
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Re: Spark 2.0.0 performance; potential large Spark core regression
Posted by Ted Yu <yu...@gmail.com>.
bq. we turned it off when fixing a bug
Adam:
Can you refer to the bug JIRA ?
Thanks
On Fri, Jul 8, 2016 at 9:22 AM, Adam Roberts <AR...@uk.ibm.com> wrote:
> Thanks Michael, we can give your options a try and aim for a 2.0.0 tuned
> vs 2.0.0 default vs 1.6.2 default comparison, for future reference the
> defaults in Spark 2 RC2 look to be:
>
> sql.shuffle.partitions: 200
> Tungsten enabled: true
> Executor memory: 1 GB (we set to 18 GB)
> kryo buffer max: 64mb
> WholeStageCodegen: on I think, we turned it off when fixing a bug
> offHeap.enabled: false
> offHeap.size: 0
>
> Cheers,
>
>
>
>
> From: Michael Allman <mi...@videoamp.com>
> To: Adam Roberts/UK/IBM@IBMGB
> Cc: dev <de...@spark.apache.org>
> Date: 08/07/2016 17:05
> Subject: Re: Spark 2.0.0 performance; potential large Spark core
> regression
> ------------------------------
>
>
>
> Here are some settings we use for some very large GraphX jobs. These are
> based on using EC2 c3.8xl workers:
>
> .set("spark.sql.shuffle.partitions", "1024")
> .set("spark.sql.tungsten.enabled", "true")
> .set("spark.executor.memory", "24g")
> .set("spark.kryoserializer.buffer.max","1g")
> .set("spark.sql.codegen.wholeStage", "true")
> .set("spark.memory.offHeap.enabled", "true")
> .set("spark.memory.offHeap.size", "25769803776") // 24 GB
>
> Some of these are in fact default configurations. Some are not.
>
> Michael
>
>
> On Jul 8, 2016, at 9:01 AM, Michael Allman <*michael@videoamp.com*
> <mi...@videoamp.com>> wrote:
>
> Hi Adam,
>
> From our experience we've found the default Spark 2.0 configuration to be
> highly suboptimal. I don't know if this affects your benchmarks, but I
> would consider running some tests with tuned and alternate configurations.
>
> Michael
>
>
> On Jul 8, 2016, at 8:58 AM, Adam Roberts <*aroberts@uk.ibm.com*
> <ar...@uk.ibm.com>> wrote:
>
> Hi Michael, the two Spark configuration files aren't very exciting
>
> * spark-env.sh*
> Same as the template apart from a JAVA_HOME setting
>
> * spark-defaults.conf*
> spark.io.compression.codec lzf
>
> * config.py* has the Spark home set, is running Spark standalone mode, we
> run and prep Spark tests only, driver 8g, executor memory 16g, Kryo, 0.66
> memory fraction, 100 trials
>
> We can post the 1.6.2 comparison early next week, running lots of
> iterations over the weekend once we get the dedicated time again
>
> Cheers,
>
>
>
>
>
> From: Michael Allman <*michael@videoamp.com* <mi...@videoamp.com>
> >
> To: Adam Roberts/UK/IBM@IBMGB
> Cc: dev <*dev@spark.apache.org* <de...@spark.apache.org>>
> Date: 08/07/2016 16:44
> Subject: Re: Spark 2.0.0 performance; potential large Spark core
> regression
> ------------------------------
>
>
>
> Hi Adam,
>
> Do you have your spark confs and your spark-env.sh somewhere where we can
> see them? If not, can you make them available?
>
> Cheers,
>
> Michael
>
> On Jul 8, 2016, at 3:17 AM, Adam Roberts <*aroberts@uk.ibm.com*
> <ar...@uk.ibm.com>> wrote:
>
> Hi, we've been testing the performance of Spark 2.0 compared to previous
> releases, unfortunately there are no Spark 2.0 compatible versions of
> HiBench and SparkPerf apart from those I'm working on (see
> *https://github.com/databricks/spark-perf/issues/108*
> <https://github.com/databricks/spark-perf/issues/108>)
>
> With the Spark 2.0 version of SparkPerf we've noticed a 30% geomean
> regression with a very small scale factor and so we've generated a couple
> of profiles comparing 1.5.2 vs 2.0.0. Same JDK version and same platform.
> We will gather a 1.6.2 comparison and increase the scale factor.
>
> Has anybody noticed a similar problem? My changes for SparkPerf and Spark
> 2.0 are very limited and AFAIK don't interfere with Spark core
> functionality, so any feedback on the changes would be much appreciated and
> welcome, I'd much prefer it if my changes are the problem.
>
> A summary for your convenience follows (this matches what I've mentioned
> on the SparkPerf issue above)
>
> 1. spark-perf/config/config.py : SCALE_FACTOR=0.05
> No. Of Workers: 1
> Executor per Worker : 1
> Executor Memory: 18G
> Driver Memory : 8G
> Serializer: kryo
>
> 2. $SPARK_HOME/conf/spark-defaults.conf: executor Java Options:
> -Xdisableexplicitgc -Xcompressedrefs
>
> Main changes I made for the benchmark itself
>
> - Use Scala 2.11.8 and Spark 2.0.0 RC2 on our local filesystem
> - MLAlgorithmTests use Vectors.fromML
> - For streaming-tests in HdfsRecoveryTest we use wordStream.foreachRDD
> not wordStream.foreach
> - KVDataTest uses awaitTerminationOrTimeout in a SparkStreamingContext
> instead of awaitTermination
> - Trivial: we use compact not compact.render for outputting json
>
>
> In Spark 2.0 the top five methods where we spend our time is as follows,
> the percentage is how much of the overall processing time was spent in this
> particular method:
> 1. AppendOnlyMap.changeValue 44%
> 2. SortShuffleWriter.write 19%
> 3. SizeTracker.estimateSize 7.5%
> 4. SizeEstimator.estimate 5.36%
> 5. Range.foreach 3.6%
>
> and in 1.5.2 the top five methods are:
> 1. AppendOnlyMap.changeValue 38%
> 2. ExternalSorter.insertAll 33%
> 3. Range.foreach 4%
> 4. SizeEstimator.estimate 2%
> 5. SizeEstimator.visitSingleObject 2%
>
> I see the following scores, on the left I have the test name followed by
> the 1.5.2 time and then the 2.0.0 time
> scheduling throughput: 5.2s vs 7.08s
> agg by key; 0.72s vs 1.01s
> agg by key int: 0.93s vs 1.19s
> agg by key naive: 1.88s vs 2.02
> sort by key: 0.64s vs 0.8s
> sort by key int: 0.59s vs 0.64s
> scala count: 0.09s vs 0.08s
> scala count w fltr: 0.31s vs 0.47s
>
> This is only running the Spark core tests (scheduling throughput through
> scala-count-w-filtr, including all inbetween).
>
> Cheers,
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
Re: Spark 2.0.0 performance; potential large Spark core regression
Posted by Adam Roberts <AR...@uk.ibm.com>.
Thanks Michael, we can give your options a try and aim for a 2.0.0 tuned
vs 2.0.0 default vs 1.6.2 default comparison, for future reference the
defaults in Spark 2 RC2 look to be:
sql.shuffle.partitions: 200
Tungsten enabled: true
Executor memory: 1 GB (we set to 18 GB)
kryo buffer max: 64mb
WholeStageCodegen: on I think, we turned it off when fixing a bug
offHeap.enabled: false
offHeap.size: 0
Cheers,
From: Michael Allman <mi...@videoamp.com>
To: Adam Roberts/UK/IBM@IBMGB
Cc: dev <de...@spark.apache.org>
Date: 08/07/2016 17:05
Subject: Re: Spark 2.0.0 performance; potential large Spark core
regression
Here are some settings we use for some very large GraphX jobs. These are
based on using EC2 c3.8xl workers:
.set("spark.sql.shuffle.partitions", "1024")
.set("spark.sql.tungsten.enabled", "true")
.set("spark.executor.memory", "24g")
.set("spark.kryoserializer.buffer.max","1g")
.set("spark.sql.codegen.wholeStage", "true")
.set("spark.memory.offHeap.enabled", "true")
.set("spark.memory.offHeap.size", "25769803776") // 24 GB
Some of these are in fact default configurations. Some are not.
Michael
On Jul 8, 2016, at 9:01 AM, Michael Allman <mi...@videoamp.com> wrote:
Hi Adam,
From our experience we've found the default Spark 2.0 configuration to be
highly suboptimal. I don't know if this affects your benchmarks, but I
would consider running some tests with tuned and alternate configurations.
Michael
On Jul 8, 2016, at 8:58 AM, Adam Roberts <ar...@uk.ibm.com> wrote:
Hi Michael, the two Spark configuration files aren't very exciting
spark-env.sh
Same as the template apart from a JAVA_HOME setting
spark-defaults.conf
spark.io.compression.codec lzf
config.py has the Spark home set, is running Spark standalone mode, we run
and prep Spark tests only, driver 8g, executor memory 16g, Kryo, 0.66
memory fraction, 100 trials
We can post the 1.6.2 comparison early next week, running lots of
iterations over the weekend once we get the dedicated time again
Cheers,
From: Michael Allman <mi...@videoamp.com>
To: Adam Roberts/UK/IBM@IBMGB
Cc: dev <de...@spark.apache.org>
Date: 08/07/2016 16:44
Subject: Re: Spark 2.0.0 performance; potential large Spark core
regression
Hi Adam,
Do you have your spark confs and your spark-env.sh somewhere where we can
see them? If not, can you make them available?
Cheers,
Michael
On Jul 8, 2016, at 3:17 AM, Adam Roberts <ar...@uk.ibm.com> wrote:
Hi, we've been testing the performance of Spark 2.0 compared to previous
releases, unfortunately there are no Spark 2.0 compatible versions of
HiBench and SparkPerf apart from those I'm working on (see
https://github.com/databricks/spark-perf/issues/108)
With the Spark 2.0 version of SparkPerf we've noticed a 30% geomean
regression with a very small scale factor and so we've generated a couple
of profiles comparing 1.5.2 vs 2.0.0. Same JDK version and same platform.
We will gather a 1.6.2 comparison and increase the scale factor.
Has anybody noticed a similar problem? My changes for SparkPerf and Spark
2.0 are very limited and AFAIK don't interfere with Spark core
functionality, so any feedback on the changes would be much appreciated
and welcome, I'd much prefer it if my changes are the problem.
A summary for your convenience follows (this matches what I've mentioned
on the SparkPerf issue above)
1. spark-perf/config/config.py : SCALE_FACTOR=0.05
No. Of Workers: 1
Executor per Worker : 1
Executor Memory: 18G
Driver Memory : 8G
Serializer: kryo
2. $SPARK_HOME/conf/spark-defaults.conf: executor Java Options:
-Xdisableexplicitgc -Xcompressedrefs
Main changes I made for the benchmark itself
Use Scala 2.11.8 and Spark 2.0.0 RC2 on our local filesystem
MLAlgorithmTests use Vectors.fromML
For streaming-tests in HdfsRecoveryTest we use wordStream.foreachRDD not
wordStream.foreach
KVDataTest uses awaitTerminationOrTimeout in a SparkStreamingContext
instead of awaitTermination
Trivial: we use compact not compact.render for outputting json
In Spark 2.0 the top five methods where we spend our time is as follows,
the percentage is how much of the overall processing time was spent in
this particular method:
1. AppendOnlyMap.changeValue 44%
2. SortShuffleWriter.write 19%
3. SizeTracker.estimateSize 7.5%
4. SizeEstimator.estimate 5.36%
5. Range.foreach 3.6%
and in 1.5.2 the top five methods are:
1. AppendOnlyMap.changeValue 38%
2. ExternalSorter.insertAll 33%
3. Range.foreach 4%
4. SizeEstimator.estimate 2%
5. SizeEstimator.visitSingleObject 2%
I see the following scores, on the left I have the test name followed by
the 1.5.2 time and then the 2.0.0 time
scheduling throughput: 5.2s vs 7.08s
agg by key; 0.72s vs 1.01s
agg by key int: 0.93s vs 1.19s
agg by key naive: 1.88s vs 2.02
sort by key: 0.64s vs 0.8s
sort by key int: 0.59s vs 0.64s
scala count: 0.09s vs 0.08s
scala count w fltr: 0.31s vs 0.47s
This is only running the Spark core tests (scheduling throughput through
scala-count-w-filtr, including all inbetween).
Cheers,
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Re: Spark 2.0.0 performance; potential large Spark core regression
Posted by Michael Allman <mi...@videoamp.com>.
Here are some settings we use for some very large GraphX jobs. These are based on using EC2 c3.8xl workers:
.set("spark.sql.shuffle.partitions", "1024")
.set("spark.sql.tungsten.enabled", "true")
.set("spark.executor.memory", "24g")
.set("spark.kryoserializer.buffer.max","1g")
.set("spark.sql.codegen.wholeStage", "true")
.set("spark.memory.offHeap.enabled", "true")
.set("spark.memory.offHeap.size", "25769803776") // 24 GB
Some of these are in fact default configurations. Some are not.
Michael
> On Jul 8, 2016, at 9:01 AM, Michael Allman <mi...@videoamp.com> wrote:
>
> Hi Adam,
>
> From our experience we've found the default Spark 2.0 configuration to be highly suboptimal. I don't know if this affects your benchmarks, but I would consider running some tests with tuned and alternate configurations.
>
> Michael
>
>
>> On Jul 8, 2016, at 8:58 AM, Adam Roberts <aroberts@uk.ibm.com <ma...@uk.ibm.com>> wrote:
>>
>> Hi Michael, the two Spark configuration files aren't very exciting
>>
>> spark-env.sh
>> Same as the template apart from a JAVA_HOME setting
>>
>> spark-defaults.conf
>> spark.io.compression.codec lzf
>>
>> config.py has the Spark home set, is running Spark standalone mode, we run and prep Spark tests only, driver 8g, executor memory 16g, Kryo, 0.66 memory fraction, 100 trials
>>
>> We can post the 1.6.2 comparison early next week, running lots of iterations over the weekend once we get the dedicated time again
>>
>> Cheers,
>>
>>
>>
>>
>>
>> From: Michael Allman <michael@videoamp.com <ma...@videoamp.com>>
>> To: Adam Roberts/UK/IBM@IBMGB
>> Cc: dev <dev@spark.apache.org <ma...@spark.apache.org>>
>> Date: 08/07/2016 16:44
>> Subject: Re: Spark 2.0.0 performance; potential large Spark core regression
>>
>>
>>
>> Hi Adam,
>>
>> Do you have your spark confs and your spark-env.sh somewhere where we can see them? If not, can you make them available?
>>
>> Cheers,
>>
>> Michael
>>
>> On Jul 8, 2016, at 3:17 AM, Adam Roberts <aroberts@uk.ibm.com <ma...@uk.ibm.com>> wrote:
>>
>> Hi, we've been testing the performance of Spark 2.0 compared to previous releases, unfortunately there are no Spark 2.0 compatible versions of HiBench and SparkPerf apart from those I'm working on (see https://github.com/databricks/spark-perf/issues/108 <https://github.com/databricks/spark-perf/issues/108>)
>>
>> With the Spark 2.0 version of SparkPerf we've noticed a 30% geomean regression with a very small scale factor and so we've generated a couple of profiles comparing 1.5.2 vs 2.0.0. Same JDK version and same platform. We will gather a 1.6.2 comparison and increase the scale factor.
>>
>> Has anybody noticed a similar problem? My changes for SparkPerf and Spark 2.0 are very limited and AFAIK don't interfere with Spark core functionality, so any feedback on the changes would be much appreciated and welcome, I'd much prefer it if my changes are the problem.
>>
>> A summary for your convenience follows (this matches what I've mentioned on the SparkPerf issue above)
>>
>> 1. spark-perf/config/config.py : SCALE_FACTOR=0.05
>> No. Of Workers: 1
>> Executor per Worker : 1
>> Executor Memory: 18G
>> Driver Memory : 8G
>> Serializer: kryo
>>
>> 2. $SPARK_HOME/conf/spark-defaults.conf: executor Java Options: -Xdisableexplicitgc -Xcompressedrefs
>>
>> Main changes I made for the benchmark itself
>> Use Scala 2.11.8 and Spark 2.0.0 RC2 on our local filesystem
>> MLAlgorithmTests use Vectors.fromML
>> For streaming-tests in HdfsRecoveryTest we use wordStream.foreachRDD not wordStream.foreach
>> KVDataTest uses awaitTerminationOrTimeout in a SparkStreamingContext instead of awaitTermination
>> Trivial: we use compact not compact.render for outputting json
>>
>> In Spark 2.0 the top five methods where we spend our time is as follows, the percentage is how much of the overall processing time was spent in this particular method:
>> 1. AppendOnlyMap.changeValue 44%
>> 2. SortShuffleWriter.write 19%
>> 3. SizeTracker.estimateSize 7.5%
>> 4. SizeEstimator.estimate 5.36%
>> 5. Range.foreach 3.6%
>>
>> and in 1.5.2 the top five methods are:
>> 1. AppendOnlyMap.changeValue 38%
>> 2. ExternalSorter.insertAll 33%
>> 3. Range.foreach 4%
>> 4. SizeEstimator.estimate 2%
>> 5. SizeEstimator.visitSingleObject 2%
>>
>> I see the following scores, on the left I have the test name followed by the 1.5.2 time and then the 2.0.0 time
>> scheduling throughput: 5.2s vs 7.08s
>> agg by key; 0.72s vs 1.01s
>> agg by key int: 0.93s vs 1.19s
>> agg by key naive: 1.88s vs 2.02
>> sort by key: 0.64s vs 0.8s
>> sort by key int: 0.59s vs 0.64s
>> scala count: 0.09s vs 0.08s
>> scala count w fltr: 0.31s vs 0.47s
>>
>> This is only running the Spark core tests (scheduling throughput through scala-count-w-filtr, including all inbetween).
>>
>> Cheers,
>>
>>
>> Unless stated otherwise above:
>> IBM United Kingdom Limited - Registered in England and Wales with number 741598.
>> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>>
>>
>> Unless stated otherwise above:
>> IBM United Kingdom Limited - Registered in England and Wales with number 741598.
>> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
Re: Spark 2.0.0 performance; potential large Spark core regression
Posted by Michael Allman <mi...@videoamp.com>.
Hi Adam,
From our experience we've found the default Spark 2.0 configuration to be highly suboptimal. I don't know if this affects your benchmarks, but I would consider running some tests with tuned and alternate configurations.
Michael
> On Jul 8, 2016, at 8:58 AM, Adam Roberts <ar...@uk.ibm.com> wrote:
>
> Hi Michael, the two Spark configuration files aren't very exciting
>
> spark-env.sh
> Same as the template apart from a JAVA_HOME setting
>
> spark-defaults.conf
> spark.io.compression.codec lzf
>
> config.py has the Spark home set, is running Spark standalone mode, we run and prep Spark tests only, driver 8g, executor memory 16g, Kryo, 0.66 memory fraction, 100 trials
>
> We can post the 1.6.2 comparison early next week, running lots of iterations over the weekend once we get the dedicated time again
>
> Cheers,
>
>
>
>
>
> From: Michael Allman <mi...@videoamp.com>
> To: Adam Roberts/UK/IBM@IBMGB
> Cc: dev <de...@spark.apache.org>
> Date: 08/07/2016 16:44
> Subject: Re: Spark 2.0.0 performance; potential large Spark core regression
>
>
>
> Hi Adam,
>
> Do you have your spark confs and your spark-env.sh somewhere where we can see them? If not, can you make them available?
>
> Cheers,
>
> Michael
>
> On Jul 8, 2016, at 3:17 AM, Adam Roberts <aroberts@uk.ibm.com <ma...@uk.ibm.com>> wrote:
>
> Hi, we've been testing the performance of Spark 2.0 compared to previous releases, unfortunately there are no Spark 2.0 compatible versions of HiBench and SparkPerf apart from those I'm working on (see https://github.com/databricks/spark-perf/issues/108 <https://github.com/databricks/spark-perf/issues/108>)
>
> With the Spark 2.0 version of SparkPerf we've noticed a 30% geomean regression with a very small scale factor and so we've generated a couple of profiles comparing 1.5.2 vs 2.0.0. Same JDK version and same platform. We will gather a 1.6.2 comparison and increase the scale factor.
>
> Has anybody noticed a similar problem? My changes for SparkPerf and Spark 2.0 are very limited and AFAIK don't interfere with Spark core functionality, so any feedback on the changes would be much appreciated and welcome, I'd much prefer it if my changes are the problem.
>
> A summary for your convenience follows (this matches what I've mentioned on the SparkPerf issue above)
>
> 1. spark-perf/config/config.py : SCALE_FACTOR=0.05
> No. Of Workers: 1
> Executor per Worker : 1
> Executor Memory: 18G
> Driver Memory : 8G
> Serializer: kryo
>
> 2. $SPARK_HOME/conf/spark-defaults.conf: executor Java Options: -Xdisableexplicitgc -Xcompressedrefs
>
> Main changes I made for the benchmark itself
> Use Scala 2.11.8 and Spark 2.0.0 RC2 on our local filesystem
> MLAlgorithmTests use Vectors.fromML
> For streaming-tests in HdfsRecoveryTest we use wordStream.foreachRDD not wordStream.foreach
> KVDataTest uses awaitTerminationOrTimeout in a SparkStreamingContext instead of awaitTermination
> Trivial: we use compact not compact.render for outputting json
>
> In Spark 2.0 the top five methods where we spend our time is as follows, the percentage is how much of the overall processing time was spent in this particular method:
> 1. AppendOnlyMap.changeValue 44%
> 2. SortShuffleWriter.write 19%
> 3. SizeTracker.estimateSize 7.5%
> 4. SizeEstimator.estimate 5.36%
> 5. Range.foreach 3.6%
>
> and in 1.5.2 the top five methods are:
> 1. AppendOnlyMap.changeValue 38%
> 2. ExternalSorter.insertAll 33%
> 3. Range.foreach 4%
> 4. SizeEstimator.estimate 2%
> 5. SizeEstimator.visitSingleObject 2%
>
> I see the following scores, on the left I have the test name followed by the 1.5.2 time and then the 2.0.0 time
> scheduling throughput: 5.2s vs 7.08s
> agg by key; 0.72s vs 1.01s
> agg by key int: 0.93s vs 1.19s
> agg by key naive: 1.88s vs 2.02
> sort by key: 0.64s vs 0.8s
> sort by key int: 0.59s vs 0.64s
> scala count: 0.09s vs 0.08s
> scala count w fltr: 0.31s vs 0.47s
>
> This is only running the Spark core tests (scheduling throughput through scala-count-w-filtr, including all inbetween).
>
> Cheers,
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Re: Spark 2.0.0 performance; potential large Spark core regression
Posted by Adam Roberts <AR...@uk.ibm.com>.
Hi Michael, the two Spark configuration files aren't very exciting
spark-env.sh
Same as the template apart from a JAVA_HOME setting
spark-defaults.conf
spark.io.compression.codec lzf
config.py has the Spark home set, is running Spark standalone mode, we run
and prep Spark tests only, driver 8g, executor memory 16g, Kryo, 0.66
memory fraction, 100 trials
We can post the 1.6.2 comparison early next week, running lots of
iterations over the weekend once we get the dedicated time again
Cheers,
From: Michael Allman <mi...@videoamp.com>
To: Adam Roberts/UK/IBM@IBMGB
Cc: dev <de...@spark.apache.org>
Date: 08/07/2016 16:44
Subject: Re: Spark 2.0.0 performance; potential large Spark core
regression
Hi Adam,
Do you have your spark confs and your spark-env.sh somewhere where we can
see them? If not, can you make them available?
Cheers,
Michael
On Jul 8, 2016, at 3:17 AM, Adam Roberts <ar...@uk.ibm.com> wrote:
Hi, we've been testing the performance of Spark 2.0 compared to previous
releases, unfortunately there are no Spark 2.0 compatible versions of
HiBench and SparkPerf apart from those I'm working on (see
https://github.com/databricks/spark-perf/issues/108)
With the Spark 2.0 version of SparkPerf we've noticed a 30% geomean
regression with a very small scale factor and so we've generated a couple
of profiles comparing 1.5.2 vs 2.0.0. Same JDK version and same platform.
We will gather a 1.6.2 comparison and increase the scale factor.
Has anybody noticed a similar problem? My changes for SparkPerf and Spark
2.0 are very limited and AFAIK don't interfere with Spark core
functionality, so any feedback on the changes would be much appreciated
and welcome, I'd much prefer it if my changes are the problem.
A summary for your convenience follows (this matches what I've mentioned
on the SparkPerf issue above)
1. spark-perf/config/config.py : SCALE_FACTOR=0.05
No. Of Workers: 1
Executor per Worker : 1
Executor Memory: 18G
Driver Memory : 8G
Serializer: kryo
2. $SPARK_HOME/conf/spark-defaults.conf: executor Java Options:
-Xdisableexplicitgc -Xcompressedrefs
Main changes I made for the benchmark itself
Use Scala 2.11.8 and Spark 2.0.0 RC2 on our local filesystem
MLAlgorithmTests use Vectors.fromML
For streaming-tests in HdfsRecoveryTest we use wordStream.foreachRDD not
wordStream.foreach
KVDataTest uses awaitTerminationOrTimeout in a SparkStreamingContext
instead of awaitTermination
Trivial: we use compact not compact.render for outputting json
In Spark 2.0 the top five methods where we spend our time is as follows,
the percentage is how much of the overall processing time was spent in
this particular method:
1. AppendOnlyMap.changeValue 44%
2. SortShuffleWriter.write 19%
3. SizeTracker.estimateSize 7.5%
4. SizeEstimator.estimate 5.36%
5. Range.foreach 3.6%
and in 1.5.2 the top five methods are:
1. AppendOnlyMap.changeValue 38%
2. ExternalSorter.insertAll 33%
3. Range.foreach 4%
4. SizeEstimator.estimate 2%
5. SizeEstimator.visitSingleObject 2%
I see the following scores, on the left I have the test name followed by
the 1.5.2 time and then the 2.0.0 time
scheduling throughput: 5.2s vs 7.08s
agg by key; 0.72s vs 1.01s
agg by key int: 0.93s vs 1.19s
agg by key naive: 1.88s vs 2.02
sort by key: 0.64s vs 0.8s
sort by key int: 0.59s vs 0.64s
scala count: 0.09s vs 0.08s
scala count w fltr: 0.31s vs 0.47s
This is only running the Spark core tests (scheduling throughput through
scala-count-w-filtr, including all inbetween).
Cheers,
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Re: Spark 2.0.0 performance; potential large Spark core regression
Posted by Michael Allman <mi...@videoamp.com>.
Hi Adam,
Do you have your spark confs and your spark-env.sh somewhere where we can see them? If not, can you make them available?
Cheers,
Michael
> On Jul 8, 2016, at 3:17 AM, Adam Roberts <ar...@uk.ibm.com> wrote:
>
> Hi, we've been testing the performance of Spark 2.0 compared to previous releases, unfortunately there are no Spark 2.0 compatible versions of HiBench and SparkPerf apart from those I'm working on (see https://github.com/databricks/spark-perf/issues/108 <https://github.com/databricks/spark-perf/issues/108>)
>
> With the Spark 2.0 version of SparkPerf we've noticed a 30% geomean regression with a very small scale factor and so we've generated a couple of profiles comparing 1.5.2 vs 2.0.0. Same JDK version and same platform. We will gather a 1.6.2 comparison and increase the scale factor.
>
> Has anybody noticed a similar problem? My changes for SparkPerf and Spark 2.0 are very limited and AFAIK don't interfere with Spark core functionality, so any feedback on the changes would be much appreciated and welcome, I'd much prefer it if my changes are the problem.
>
> A summary for your convenience follows (this matches what I've mentioned on the SparkPerf issue above)
>
> 1. spark-perf/config/config.py : SCALE_FACTOR=0.05
> No. Of Workers: 1
> Executor per Worker : 1
> Executor Memory: 18G
> Driver Memory : 8G
> Serializer: kryo
>
> 2. $SPARK_HOME/conf/spark-defaults.conf: executor Java Options: -Xdisableexplicitgc -Xcompressedrefs
>
> Main changes I made for the benchmark itself
> Use Scala 2.11.8 and Spark 2.0.0 RC2 on our local filesystem
> MLAlgorithmTests use Vectors.fromML
> For streaming-tests in HdfsRecoveryTest we use wordStream.foreachRDD not wordStream.foreach
> KVDataTest uses awaitTerminationOrTimeout in a SparkStreamingContext instead of awaitTermination
> Trivial: we use compact not compact.render for outputting json
>
> In Spark 2.0 the top five methods where we spend our time is as follows, the percentage is how much of the overall processing time was spent in this particular method:
> 1. AppendOnlyMap.changeValue 44%
> 2. SortShuffleWriter.write 19%
> 3. SizeTracker.estimateSize 7.5%
> 4. SizeEstimator.estimate 5.36%
> 5. Range.foreach 3.6%
>
> and in 1.5.2 the top five methods are:
> 1. AppendOnlyMap.changeValue 38%
> 2. ExternalSorter.insertAll 33%
> 3. Range.foreach 4%
> 4. SizeEstimator.estimate 2%
> 5. SizeEstimator.visitSingleObject 2%
>
> I see the following scores, on the left I have the test name followed by the 1.5.2 time and then the 2.0.0 time
> scheduling throughput: 5.2s vs 7.08s
> agg by key; 0.72s vs 1.01s
> agg by key int: 0.93s vs 1.19s
> agg by key naive: 1.88s vs 2.02
> sort by key: 0.64s vs 0.8s
> sort by key int: 0.59s vs 0.64s
> scala count: 0.09s vs 0.08s
> scala count w fltr: 0.31s vs 0.47s
>
> This is only running the Spark core tests (scheduling throughput through scala-count-w-filtr, including all inbetween).
>
> Cheers,
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU