You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "zhengruifeng (Jira)" <ji...@apache.org> on 2020/06/10 03:54:00 UTC
[jira] [Created] (SPARK-31948) expose mapSideCombine in
aggByKey/reduceByKey/foldByKey
zhengruifeng created SPARK-31948:
------------------------------------
Summary: expose mapSideCombine in aggByKey/reduceByKey/foldByKey
Key: SPARK-31948
URL: https://issues.apache.org/jira/browse/SPARK-31948
Project: Spark
Issue Type: Improvement
Components: ML, Spark Core
Affects Versions: 3.1.0
Reporter: zhengruifeng
{{1, aggregateByKey,}} {{reduceByKey}} and {{foldByKey}} will always perform {{mapSideCombine}};
However, this can be skiped sometime, specially in ML (RobustScaler):
{code:java}
vectors.mapPartitions { iter =>
if (iter.hasNext) {
val summaries = Array.fill(numFeatures)(
new QuantileSummaries(QuantileSummaries.defaultCompressThreshold, relativeError))
while (iter.hasNext) {
val vec = iter.next
vec.foreach { (i, v) => if (!v.isNaN) summaries(i) = summaries(i).insert(v) }
}
Iterator.tabulate(numFeatures)(i => (i, summaries(i).compress))
} else Iterator.empty
}.reduceByKey { case (s1, s2) => s1.merge(s2) } {code}
This {{reduceByKey}} in {{RobustScaler}} does not need {{mapSideCombine}} at all, similar places exist in {{KMeans}}, {{GMM}}, etc;
to my knowledge, we do not need {{mapSideCombine}} if the reduction factor isn't high;
2, {{treeAggregate}} and {{treeReduce}} are based on {{foldByKey}}, the {{mapSideCombine in the first call of }}{{foldByKey can also be avoided.}}{{}}
{{SPARK-772: "Map side combine in group by key case does not reduce the amount of data shuffled. Instead, it forces a lot more objects to go into old gen, and leads to worse GC."}}
{{So what about:}}
{{1, exposing mapSideCombine in aggByKey/reduceByKey/foldByKey, so that user can disable unnecessary }}{{mapSideCombine}}{{;}}
{{2, disabling the }}{{mapSideCombine in the first call of }}{{foldByKey in }}{{treeAggregate}}{{ and }}{{treeReduce}}{{;}}{{}}
{{3, }}{{disabling the uncessary }}{{mapSideCombine in ML;}}{{}}
{{friendly ping [~srowen] [~huaxingao] [~weichenxu123] [~hyukjin.kwon] [~viirya] }}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org