You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean R. Owen (Jira)" <ji...@apache.org> on 2020/06/10 14:46:00 UTC
[jira] [Commented] (SPARK-31948) expose mapSideCombine in aggByKey/reduceByKey/foldByKey

    [ https://issues.apache.org/jira/browse/SPARK-31948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17130754#comment-17130754 ] 

Sean R. Owen commented on SPARK-31948:
--------------------------------------

Do you have any info about the speedup? I don't know enough about the code to say, but I wouldn't guess map-side combining is a bad thing? it would tend to reduce shuffled data, and if there's little to combine, doesn't take much time.

> expose mapSideCombine in aggByKey/reduceByKey/foldByKey
> -------------------------------------------------------
>
>                 Key: SPARK-31948
>                 URL: https://issues.apache.org/jira/browse/SPARK-31948
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, Spark Core
>    Affects Versions: 3.1.0
>            Reporter: zhengruifeng
>            Priority: Minor
>
> 1. {{aggregateByKey}}, {{reduceByKey}} and  {{foldByKey}} will always perform {{mapSideCombine}};
> However, this can be skiped sometime, specially in ML (RobustScaler):
> {code:java}
> vectors.mapPartitions { iter =>
>   if (iter.hasNext) {
>     val summaries = Array.fill(numFeatures)(
>       new QuantileSummaries(QuantileSummaries.defaultCompressThreshold, relativeError))
>     while (iter.hasNext) {
>       val vec = iter.next
>       vec.foreach { (i, v) => if (!v.isNaN) summaries(i) = summaries(i).insert(v) }
>     }
>     Iterator.tabulate(numFeatures)(i => (i, summaries(i).compress))
>   } else Iterator.empty
> }.reduceByKey { case (s1, s2) => s1.merge(s2) } {code}
>  
> This {{reduceByKey}} in {{RobustScaler}} does not need {{mapSideCombine}} at all, similar places exist in {{KMeans}}, {{GMM}}, etc;
> To my knowledge, we do not need {{mapSideCombine}} if the reduction factor isn't high;
>  
> 2. {{treeAggregate}} and {{treeReduce}} are based on {{foldByKey}},  the {{mapSideCombine}} in the first call of {{foldByKey}} can also be avoided.
>  
> SPARK-772:
> {quote}
> Map side combine in group by key case does not reduce the amount of data shuffled. Instead, it forces a lot more objects to go into old gen, and leads to worse GC.
> {quote}
>  
> So what about:
> 1. exposing mapSideCombine in {{aggByKey}}/{{reduceByKey}}/{{foldByKey}}, so that user can disable unnecessary mapSideCombine
> 2. disabling the {{mapSideCombine}} in the first call of {{foldByKey}} in  {{treeAggregate}} and {{treeReduce}}
> 3. disabling the unnecessary {{mapSideCombine}} in ML;
> Friendly ping [~srowen] [~huaxingao] [~weichenxu123] [~hyukjin.kwon] [~viirya]  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org