You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Adam Roberts (JIRA)" <ji...@apache.org> on 2016/11/02 16:51:58 UTC
[jira] [Created] (SPARK-18231) Optimise SizeEstimator
implementation
Adam Roberts created SPARK-18231:
------------------------------------
Summary: Optimise SizeEstimator implementation
Key: SPARK-18231
URL: https://issues.apache.org/jira/browse/SPARK-18231
Project: Spark
Issue Type: Improvement
Affects Versions: 2.0.1, 1.6.2
Reporter: Adam Roberts
The SizeEstimator is used in Spark to determine whether or not we need to spill -- we know spilling typically has an adverse impact on performance and it's something we want to minimise
We can improve the implementation of SizeEstimator in a variety of ways to gain a performance and increase and ultimately a reduction in footprint by spilling less
There are two phases involved here
1) refactor to use more efficient data structures, to avoid some reflection calls (expensive), to remove the use of ScalaRunTime.array_apply, to use ThreadLocalRandom, to store an array of field offsets instead of a list of pointer fields and to improve the performance of the sample method
2) add JDK specialisms to use exact object sizes to reduce overestimations for both Open/Oracle JDK users and IBM Java users. With an accurate estimator we can therefore spill less (--footprint, ++performance -- we have observed a 15% reduction in RDD sizes leading to potentially double digit performance gains on HiBench and micro benchmarks)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org