You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Matei Zaharia (JIRA)" <ji...@apache.org> on 2014/06/06 03:12:01 UTC

[jira] [Created] (SPARK-2048) Optimizations to CPU usage of external spilling code

Matei Zaharia created SPARK-2048:
------------------------------------

             Summary: Optimizations to CPU usage of external spilling code
                 Key: SPARK-2048
                 URL: https://issues.apache.org/jira/browse/SPARK-2048
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
            Reporter: Matei Zaharia


In the external spilling code in ExternalAppendOnlyMap and CoGroupedRDD, there are a few opportunities for optimization:
- There are lots of uses of pattern-matching on Tuple2 (e.g. val (k, v) = pair), which we found to be much slower than accessing fields directly
- Hash codes for each element are computed many times in StreamBuffer.minKeyHash, which will be expensive for some data types
- Uses of buffer.remove() may be expensive if there are lots of hash collisions (better to swap in the last element into that position)
- More objects are allocated than is probably necessary, e.g. ArrayBuffers and pairs

These should help because situations where we're spilling are also ones where there is presumably a lot of GC pressure in the new generation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)