You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Adam Roberts (JIRA)" <ji...@apache.org> on 2016/11/02 16:51:58 UTC

[jira] [Created] (SPARK-18231) Optimise SizeEstimator implementation

Adam Roberts created SPARK-18231:
------------------------------------

             Summary: Optimise SizeEstimator implementation
                 Key: SPARK-18231
                 URL: https://issues.apache.org/jira/browse/SPARK-18231
             Project: Spark
          Issue Type: Improvement
    Affects Versions: 2.0.1, 1.6.2
            Reporter: Adam Roberts


The SizeEstimator is used in Spark to determine whether or not we need to spill -- we know spilling typically has an adverse impact on performance and it's something we want to minimise

We can improve the implementation of SizeEstimator in a variety of ways to gain a performance and increase and ultimately a reduction in footprint by spilling less

There are two phases involved here

1) refactor to use more efficient data structures, to avoid some reflection calls (expensive), to remove the use of ScalaRunTime.array_apply, to use ThreadLocalRandom, to store an array of field offsets instead of a list of pointer fields and to improve the performance of the sample method

2) add JDK specialisms to use exact object sizes to reduce overestimations for both Open/Oracle JDK users and IBM Java users. With an accurate estimator we can therefore spill less (--footprint, ++performance -- we have observed a 15% reduction in RDD sizes leading to potentially double digit performance gains on HiBench and micro benchmarks)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org