You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by yucai <gi...@git.apache.org> on 2018/08/10 05:26:07 UTC

[GitHub] spark pull request #22066: [SPARK-25084][SQL] "distribute by" on multiple co...

GitHub user yucai opened a pull request:

    https://github.com/apache/spark/pull/22066

    [SPARK-25084][SQL] "distribute by" on multiple columns may lead to codegen issue

    ## What changes were proposed in this pull request?
    
    "distribute by" on multiple columns may lead to codegen issue
    
    ## How was this patch tested?
    
    UTs.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yucai/spark SPARK-25084

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22066.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22066
    
----
commit 8ee56bbfaacdd64b1712d72650a39939ca3b13f2
Author: yucai <yy...@...>
Date:   2018-08-10T05:19:43Z

    "distribute by" on multiple columns may lead to codegen issue

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22066: [SPARK-25084][SQL] "distribute by" on multiple columns m...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22066
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22066: [SPARK-25084][SQL] "distribute by" on multiple columns m...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22066
  
    **[Test build #94543 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94543/testReport)** for PR 22066 at commit [`8ee56bb`](https://github.com/apache/spark/commit/8ee56bbfaacdd64b1712d72650a39939ca3b13f2).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22066: [SPARK-25084][SQL] "distribute by" on multiple columns (...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22066
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22066: [SPARK-25084][SQL] "distribute by" on multiple columns (...

Posted by yucai <gi...@git.apache.org>.
Github user yucai commented on the issue:

    https://github.com/apache/spark/pull/22066
  
    @LantaoJin I realized the initial way had some issue, so I marked it as WIP to refine and add test. It is different from your original implementation, so I would like to use this one.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22066: [WIP][SPARK-25084][SQL] "distribute by" on multiple colu...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22066
  
    can you add a test first?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22066: [WIP][SPARK-25084][SQL] "distribute by" on multiple colu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22066
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22066: [SPARK-25084][SQL] "distribute by" on multiple columns (...

Posted by yucai <gi...@git.apache.org>.
Github user yucai commented on the issue:

    https://github.com/apache/spark/pull/22066
  
    @cloud-fan @jerryshao sure, I will do it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22066: [SPARK-25084][SQL] "distribute by" on multiple columns m...

Posted by yucai <gi...@git.apache.org>.
Github user yucai commented on the issue:

    https://github.com/apache/spark/pull/22066
  
    @cloud-fan @gatorsmile PR has been ready, kindly help review.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22066: [SPARK-25084][SQL] "distribute by" on multiple columns (...

Posted by yucai <gi...@git.apache.org>.
Github user yucai commented on the issue:

    https://github.com/apache/spark/pull/22066
  
    @cloud-fan Jira and 1st is from this one. It is critical to our 2.3 migration.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22066: [SPARK-25084][SQL] "distribute by" on multiple columns m...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22066
  
    **[Test build #94557 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94557/testReport)** for PR 22066 at commit [`931fa28`](https://github.com/apache/spark/commit/931fa28861f15ef1c31a51787f3bd59f2284de89).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22066: [SPARK-25084][SQL] "distribute by" on multiple columns (...

Posted by LantaoJin <gi...@git.apache.org>.
Github user LantaoJin commented on the issue:

    https://github.com/apache/spark/pull/22066
  
    Thank you @yucai . New PR #22077 for branch-2.3. Cc: @cloud-fan @jerryshao 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22066: [SPARK-25084][SQL] "distribute by" on multiple columns (...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22066
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22066: [SPARK-25084][SQL] "distribute by" on multiple co...

Posted by yucai <gi...@git.apache.org>.
Github user yucai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22066#discussion_r209426886
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala ---
    @@ -404,21 +404,26 @@ abstract class HashExpression[E] extends Expression {
           input: String,
           result: String,
           fields: Array[StructField]): String = {
    +    val tmpInput = ctx.freshName("input")
         val fieldsHash = fields.zipWithIndex.map { case (field, index) =>
    -      nullSafeElementHash(input, index.toString, field.nullable, field.dataType, result, ctx)
    +      nullSafeElementHash(tmpInput, index.toString, field.nullable, field.dataType, result, ctx)
         }
         val hashResultType = CodeGenerator.javaType(dataType)
    -    ctx.splitExpressions(
    +    val code = ctx.splitExpressions(
           expressions = fieldsHash,
           funcName = "computeHashForStruct",
    -      arguments = Seq("InternalRow" -> input, hashResultType -> result),
    +      arguments = Seq("InternalRow" -> tmpInput, hashResultType -> result),
           returnType = hashResultType,
           makeSplitFunction = body =>
             s"""
                |$body
                |return $result;
              """.stripMargin,
           foldFunctions = _.map(funcCall => s"$result = $funcCall;").mkString("\n"))
    +    s"""
    +       |final InternalRow $tmpInput = $input;
    --- End diff --
    
    Yes, very agree, we can improve this in the future.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22066: [SPARK-25084][SQL] "distribute by" on multiple columns m...

Posted by LantaoJin <gi...@git.apache.org>.
Github user LantaoJin commented on the issue:

    https://github.com/apache/spark/pull/22066
  
    Since you refactor your code copying from #22067 . Would you mind just use that?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22066: [SPARK-25084][SQL] "distribute by" on multiple columns (...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22066
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94564/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22066: [SPARK-25084][SQL] "distribute by" on multiple columns m...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22066
  
    **[Test build #94564 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94564/testReport)** for PR 22066 at commit [`5da6d4d`](https://github.com/apache/spark/commit/5da6d4d437c6eb2bdf7a64c031b7c9281a5a8b83).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22066: [SPARK-25084][SQL] "distribute by" on multiple co...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22066#discussion_r209426589
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala ---
    @@ -404,21 +404,26 @@ abstract class HashExpression[E] extends Expression {
           input: String,
           result: String,
           fields: Array[StructField]): String = {
    +    val tmpInput = ctx.freshName("input")
         val fieldsHash = fields.zipWithIndex.map { case (field, index) =>
    -      nullSafeElementHash(input, index.toString, field.nullable, field.dataType, result, ctx)
    +      nullSafeElementHash(tmpInput, index.toString, field.nullable, field.dataType, result, ctx)
         }
         val hashResultType = CodeGenerator.javaType(dataType)
    -    ctx.splitExpressions(
    +    val code = ctx.splitExpressions(
           expressions = fieldsHash,
           funcName = "computeHashForStruct",
    -      arguments = Seq("InternalRow" -> input, hashResultType -> result),
    +      arguments = Seq("InternalRow" -> tmpInput, hashResultType -> result),
           returnType = hashResultType,
           makeSplitFunction = body =>
             s"""
                |$body
                |return $result;
              """.stripMargin,
           foldFunctions = _.map(funcCall => s"$result = $funcCall;").mkString("\n"))
    +    s"""
    +       |final InternalRow $tmpInput = $input;
    --- End diff --
    
    Note: we can avoid creating a new variable if the `input` is already a variable. I think this can be done after we fully adopt the new codegen infra from @viirya 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22066: [SPARK-25084][SQL] "distribute by" on multiple co...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22066#discussion_r209275505
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala ---
    @@ -2831,4 +2831,17 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext {
           }
         }
       }
    +
    +  test("SPARK-25084: 'distribute by' on multiple columns may lead to codegen issue") {
    +    withView("spark_25084") {
    +      val count = 1000
    --- End diff --
    
    sorry, disregard the comment, thanks


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22066: [SPARK-25084][SQL] "distribute by" on multiple columns (...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22066
  
    **[Test build #94564 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94564/testReport)** for PR 22066 at commit [`5da6d4d`](https://github.com/apache/spark/commit/5da6d4d437c6eb2bdf7a64c031b7c9281a5a8b83).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22066: [WIP][SPARK-25084][SQL] "distribute by" on multiple colu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22066
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94543/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22066: [SPARK-25084][SQL] "distribute by" on multiple columns (...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22066
  
    **[Test build #94557 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94557/testReport)** for PR 22066 at commit [`931fa28`](https://github.com/apache/spark/commit/931fa28861f15ef1c31a51787f3bd59f2284de89).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22066: [WIP][SPARK-25084][SQL] "distribute by" on multiple colu...

Posted by LantaoJin <gi...@git.apache.org>.
Github user LantaoJin commented on the issue:

    https://github.com/apache/spark/pull/22066
  
    I offer other fix way. #22067 
    It doesn't need "input" as a global variable (If distribute by random)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22066: [SPARK-25084][SQL] "distribute by" on multiple co...

Posted by yucai <gi...@git.apache.org>.
Github user yucai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22066#discussion_r209267827
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala ---
    @@ -778,21 +783,22 @@ case class HiveHash(children: Seq[Expression]) extends HashExpression[Int] {
           input: String,
           result: String,
           fields: Array[StructField]): String = {
    +    val tmpInput = ctx.freshName("input")
    --- End diff --
    
    Seems like `HiveHash` cannot be triggered in the normal way. Because Spark uses `Murmur3Hash`.
    But this function does have this issue. You can hack to test in this way.
    In `HashPartitioning`:
    ```
      def partitionIdExpression: Expression = Pmod(new Murmur3Hash(expressions), Literal(numPartitions))
    ```
    to
    ```
      def partitionIdExpression: Expression = Pmod(new HiveHash(expressions), Literal(numPartitions))
    ```
    Then run tests:
    ```
      val df = spark.range(1000)
      val columns = (0 until 400).map{ i => s"id as id$i" }
      val distributeExprs = (0 until 100).map(c => s"id$c").mkString(",")
      df.selectExpr(columns : _*).createTempView("test")
      spark.sql(s"select * from test distribute by ($distributeExprs)").count()
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22066: [SPARK-25084][SQL] "distribute by" on multiple columns (...

Posted by jerryshao <gi...@git.apache.org>.
Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/22066
  
    @cloud-fan , yeah, I will include it in 2.3.2.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22066: [WIP][SPARK-25084][SQL] "distribute by" on multiple colu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22066
  
    **[Test build #94543 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94543/testReport)** for PR 22066 at commit [`8ee56bb`](https://github.com/apache/spark/commit/8ee56bbfaacdd64b1712d72650a39939ca3b13f2).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22066: [SPARK-25084][SQL] "distribute by" on multiple columns (...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22066
  
    thanks, merging to master!
    
    @yucai can you send a new PR for 2.3? We did a lot of codegen changes in 2.4 so it's safer to make sure we pass all the tests in 2.3.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22066: [SPARK-25084][SQL] "distribute by" on multiple co...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22066#discussion_r209260954
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala ---
    @@ -2831,4 +2831,17 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext {
           }
         }
       }
    +
    +  test("SPARK-25084: 'distribute by' on multiple columns may lead to codegen issue") {
    +    withView("spark_25084") {
    +      val count = 1000
    --- End diff --
    
    nit: we can probably inline in the next line


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22066: [WIP][SPARK-25084][SQL] "distribute by" on multiple colu...

Posted by yucai <gi...@git.apache.org>.
Github user yucai commented on the issue:

    https://github.com/apache/spark/pull/22066
  
    @cloud-fan I am refining and adding tests.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22066: [SPARK-25084][SQL] "distribute by" on multiple columns m...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22066
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22066: [SPARK-25084][SQL] "distribute by" on multiple co...

Posted by yucai <gi...@git.apache.org>.
Github user yucai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22066#discussion_r209265323
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala ---
    @@ -2831,4 +2831,17 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext {
           }
         }
       }
    +
    +  test("SPARK-25084: 'distribute by' on multiple columns may lead to codegen issue") {
    +    withView("spark_25084") {
    +      val count = 1000
    --- End diff --
    
    How to inline? We still use it in the assert.
    ```
          assert(
            spark.sql(s"select * from spark_25084 distribute by ($distributeExprs)").count()
              === count)
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22066: [SPARK-25084][SQL] "distribute by" on multiple columns (...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/22066
  
    > @viirya is that effort going on? I can help with the work if you want. Thanks.
    
    @mgaido91 Yeah, I'm still working on it. One of the PRs #21537 is still waiting for review.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22066: [SPARK-25084][SQL] "distribute by" on multiple columns (...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22066
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94557/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22066: [SPARK-25084][SQL] "distribute by" on multiple columns (...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22066
  
    cc @jerryshao this is a regression for 2.3, we should have it in 2.3.2


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22066: [SPARK-25084][SQL] "distribute by" on multiple co...

Posted by wangyum <gi...@git.apache.org>.
Github user wangyum commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22066#discussion_r209259163
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala ---
    @@ -778,21 +783,22 @@ case class HiveHash(children: Seq[Expression]) extends HashExpression[Int] {
           input: String,
           result: String,
           fields: Array[StructField]): String = {
    +    val tmpInput = ctx.freshName("input")
    --- End diff --
    
    Does Hive have this issue?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22066: [SPARK-25084][SQL] "distribute by" on multiple columns (...

Posted by yucai <gi...@git.apache.org>.
Github user yucai commented on the issue:

    https://github.com/apache/spark/pull/22066
  
    @cloud-fan Synced with @LantaoJin he will help port to 2.3 soon and I will review it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #22066: [SPARK-25084][SQL] "distribute by" on multiple co...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/22066


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #22066: [SPARK-25084][SQL] "distribute by" on multiple columns (...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22066
  
    This PR looks identical to #22067 , which one is the first PR?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org