You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "zhengruifeng (Jira)" <ji...@apache.org> on 2020/06/10 03:54:00 UTC

[jira] [Created] (SPARK-31948) expose mapSideCombine in aggByKey/reduceByKey/foldByKey

zhengruifeng created SPARK-31948:
------------------------------------

             Summary: expose mapSideCombine in aggByKey/reduceByKey/foldByKey
                 Key: SPARK-31948
                 URL: https://issues.apache.org/jira/browse/SPARK-31948
             Project: Spark
          Issue Type: Improvement
          Components: ML, Spark Core
    Affects Versions: 3.1.0
            Reporter: zhengruifeng


{{1, aggregateByKey,}} {{reduceByKey}} and  {{foldByKey}} will always perform {{mapSideCombine}};

However, this can be skiped sometime, specially in ML (RobustScaler):
{code:java}
vectors.mapPartitions { iter =>
  if (iter.hasNext) {
    val summaries = Array.fill(numFeatures)(
      new QuantileSummaries(QuantileSummaries.defaultCompressThreshold, relativeError))
    while (iter.hasNext) {
      val vec = iter.next
      vec.foreach { (i, v) => if (!v.isNaN) summaries(i) = summaries(i).insert(v) }
    }
    Iterator.tabulate(numFeatures)(i => (i, summaries(i).compress))
  } else Iterator.empty
}.reduceByKey { case (s1, s2) => s1.merge(s2) } {code}
 

This {{reduceByKey}} in {{RobustScaler}} does not need {{mapSideCombine}} at all, similar places exist in {{KMeans}}, {{GMM}}, etc;

to my knowledge, we do not need {{mapSideCombine}} if the reduction factor isn't high;

 

2, {{treeAggregate}} and {{treeReduce}} are based on {{foldByKey}},  the {{mapSideCombine in the first call of }}{{foldByKey can also be avoided.}}{{}}

 

{{SPARK-772: "Map side combine in group by key case does not reduce the amount of data shuffled. Instead, it forces a lot more objects to go into old gen, and leads to worse GC."}}

 

{{So what about:}}

{{1, exposing mapSideCombine in aggByKey/reduceByKey/foldByKey, so that user can disable unnecessary }}{{mapSideCombine}}{{;}}

{{2, disabling the }}{{mapSideCombine in the first call of }}{{foldByKey in }}{{treeAggregate}}{{ and }}{{treeReduce}}{{;}}{{}}

{{3, }}{{disabling the uncessary }}{{mapSideCombine in ML;}}{{}}

 

 

{{friendly ping [~srowen] [~huaxingao] [~weichenxu123] [~hyukjin.kwon] [~viirya] }}

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org