You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by viirya <gi...@git.apache.org> on 2018/06/07 03:41:32 UTC

[GitHub] spark pull request #21498: [SPARK-24410][SQL][Core] Optimization for Union o...

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21498#discussion_r193618338
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
    @@ -1099,6 +1099,17 @@ object SQLConf {
           .intConf
           .createWithDefault(SHUFFLE_SPILL_NUM_ELEMENTS_FORCE_SPILL_THRESHOLD.defaultValue.get)
     
    +  val UNION_IN_SAME_PARTITION =
    +    buildConf("spark.sql.unionInSamePartition")
    +      .internal()
    +      .doc("When true, Union operator will union children results in the same corresponding " +
    +        "partitions if they have same partitioning. This eliminates unnecessary shuffle in later " +
    +        "operators like aggregation. Note that because non-deterministic functions such as " +
    +        "monotonically_increasing_id are depended on partition id. By doing this, the values of " +
    --- End diff --
    
    I'm a bit not convinced by the reason and behavior of keeping the value of non-deterministic functions after an union. Like in the following queries:
    
    ```scala
    val df1 = spark.range(10).select(monotonically_increasing_id())
    val df2 = spark.range(10).select(monotonically_increasing_id())
    val union = df1.union(df2)
    ```
    
    Now we keep the values of `monotonically_increasing_id` returned by `df1`, `df2` and `union` are the same. However, as non-deterministic functions, the values changing by data layout/sequence sounds still reasonable.
    
    Anyway that is current behavior and this config need users to enable this feature explicitly.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org