You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2017/12/04 15:30:40 UTC

[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...

GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/19878

    [SPARK-22682][SQL] HashExpression does not need to create global variables

    ## What changes were proposed in this pull request?
    
    It turns out that `HashExpression` can pass around some values via parameter when splitting codes into methods, to save some global variable slots.
    
    This can also prevent a weird case that global variable appears in parameter list, which is discovered by https://github.com/apache/spark/pull/19865
    
    ## How was this patch tested?
    
    existing tests

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark minor

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19878.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19878
    
----
commit 0e9998e0704b54d8f1352a1936c9b6367ebee15e
Author: Wenchen Fan <we...@databricks.com>
Date:   2017-12-04T15:24:46Z

    HashExpression does not need to create global variables

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...

Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19878#discussion_r154805001
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala ---
    @@ -270,17 +270,36 @@ abstract class HashExpression[E] extends Expression {
     
       override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
         ev.isNull = "false"
    -    val childrenHash = ctx.splitExpressions(children.map { child =>
    +
    +    val childrenHash = children.map { child =>
           val childGen = child.genCode(ctx)
           childGen.code + ctx.nullSafeExec(child.nullable, childGen.isNull) {
             computeHash(childGen.value, child.dataType, ev.value, ctx)
           }
    -    })
    +    }
    +
    +    val hashResultType = ctx.javaType(dataType)
    +    val codes = if (ctx.INPUT_ROW == null || ctx.currentVars != null) {
    --- End diff --
    
    That one has been merged, but this one is still different. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19878: [SPARK-22682][SQL] HashExpression does not need to creat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19878
  
    **[Test build #84431 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84431/testReport)** for PR 19878 at commit [`0e9998e`](https://github.com/apache/spark/commit/0e9998e0704b54d8f1352a1936c9b6367ebee15e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19878: [SPARK-22682][SQL] HashExpression does not need to creat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19878
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84431/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19878#discussion_r154845901
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala ---
    @@ -730,23 +776,29 @@ case class HiveHash(children: Seq[Expression]) extends HashExpression[Int] {
           input: String,
           result: String,
           fields: Array[StructField]): String = {
    -    val localResult = ctx.freshName("localResult")
         val childResult = ctx.freshName("childResult")
    -    fields.zipWithIndex.map { case (field, index) =>
    +    val fieldsHash = fields.zipWithIndex.map { case (field, index) =>
    +      val computeFieldHash = nullSafeElementHash(
    +        input, index.toString, field.nullable, field.dataType, childResult, ctx)
           s"""
    -         $childResult = 0;
    -         ${nullSafeElementHash(input, index.toString, field.nullable, field.dataType,
    -           childResult, ctx)}
    -         $localResult = (31 * $localResult) + $childResult;
    -       """
    -    }.mkString(
    -      s"""
    -         int $localResult = 0;
    -         int $childResult = 0;
    -       """,
    -      "",
    -      s"$result = (31 * $result) + $localResult;"
    -    )
    +         |$childResult = 0;
    +         |$computeFieldHash
    +         |$result = (31 * $result) + $childResult;
    +       """.stripMargin
    +    }
    +
    +    s"${ctx.JAVA_INT} $childResult = 0;\n" + ctx.splitExpressions(
    --- End diff --
    
    Yea, the input here is a row that may be produced by `row.getStruct` instead of `ctx.INPUT_ROW`, so we don't need this check as the input won't be `ctx.currentVars`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19878#discussion_r154683716
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala ---
    @@ -270,17 +270,36 @@ abstract class HashExpression[E] extends Expression {
     
       override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
         ev.isNull = "false"
    -    val childrenHash = ctx.splitExpressions(children.map { child =>
    +
    +    val childrenHash = children.map { child =>
           val childGen = child.genCode(ctx)
           childGen.code + ctx.nullSafeExec(child.nullable, childGen.isNull) {
             computeHash(childGen.value, child.dataType, ev.value, ctx)
           }
    -    })
    +    }
    +
    +    val hashResultType = ctx.javaType(dataType)
    +    val codes = if (ctx.INPUT_ROW == null || ctx.currentVars != null) {
    --- End diff --
    
    I think @kiszk is doing this


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19878#discussion_r154819030
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala ---
    @@ -610,25 +637,44 @@ case class HiveHash(children: Seq[Expression]) extends HashExpression[Int] {
     
       override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
         ev.isNull = "false"
    +
         val childHash = ctx.freshName("childHash")
    -    val childrenHash = ctx.splitExpressions(children.map { child =>
    +    val childrenHash = children.map { child =>
           val childGen = child.genCode(ctx)
           val codeToComputeHash = ctx.nullSafeExec(child.nullable, childGen.isNull) {
             computeHash(childGen.value, child.dataType, childHash, ctx)
           }
           s"""
              |${childGen.code}
    +         |$childHash = 0;
              |$codeToComputeHash
              |${ev.value} = (31 * ${ev.value}) + $childHash;
    -         |$childHash = 0;
            """.stripMargin
    -    })
    +    }
     
    -    ctx.addMutableState(ctx.javaType(dataType), ev.value)
    -    ctx.addMutableState(ctx.JAVA_INT, childHash, s"$childHash = 0;")
    -    ev.copy(code = s"""
    -      ${ev.value} = $seed;
    -      $childrenHash""")
    +    val codes = if (ctx.INPUT_ROW == null || ctx.currentVars != null) {
    +      childrenHash.mkString("\n")
    +    } else {
    +      ctx.splitExpressions(
    +        expressions = childrenHash,
    +        funcName = "computeHash",
    +        arguments = Seq("InternalRow" -> ctx.INPUT_ROW, ctx.JAVA_INT -> ev.value),
    +        returnType = ctx.JAVA_INT,
    +        makeSplitFunction = body =>
    +          s"""
    +             |${ctx.JAVA_INT} $childHash = 0;
    +             |$body
    +             |return ${ev.value};
    +           """.stripMargin,
    +        foldFunctions = _.map(funcCall => s"${ev.value} = $funcCall;").mkString("\n"))
    +    }
    +
    +    ev.copy(code =
    +      s"""
    +         |${ctx.JAVA_INT} ${ev.value} = $seed;
    +         |${ctx.JAVA_INT} $childHash = 0;
    --- End diff --
    
    nvm, `splitExpressions` could possibly not split expressions if only one block.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19878: [SPARK-22682][SQL] HashExpression does not need to creat...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/19878
  
    left only one very minor comment, it LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19878: [SPARK-22682][SQL] HashExpression does not need to creat...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19878
  
    thanks, merging to master!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19878: [SPARK-22682][SQL] HashExpression does not need to creat...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/19878
  
    LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19878: [SPARK-22682][SQL] HashExpression does not need to creat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19878
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19878: [SPARK-22682][SQL] HashExpression does not need to creat...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19878
  
    cc @kiszk @mgaido91 @viirya @gatorsmile 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19878#discussion_r154819410
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala ---
    @@ -730,23 +776,29 @@ case class HiveHash(children: Seq[Expression]) extends HashExpression[Int] {
           input: String,
           result: String,
           fields: Array[StructField]): String = {
    -    val localResult = ctx.freshName("localResult")
         val childResult = ctx.freshName("childResult")
    -    fields.zipWithIndex.map { case (field, index) =>
    +    val fieldsHash = fields.zipWithIndex.map { case (field, index) =>
    +      val computeFieldHash = nullSafeElementHash(
    +        input, index.toString, field.nullable, field.dataType, childResult, ctx)
           s"""
    -         $childResult = 0;
    -         ${nullSafeElementHash(input, index.toString, field.nullable, field.dataType,
    -           childResult, ctx)}
    -         $localResult = (31 * $localResult) + $childResult;
    -       """
    -    }.mkString(
    -      s"""
    -         int $localResult = 0;
    -         int $childResult = 0;
    -       """,
    -      "",
    -      s"$result = (31 * $result) + $localResult;"
    -    )
    +         |$childResult = 0;
    +         |$computeFieldHash
    +         |$result = (31 * $result) + $childResult;
    +       """.stripMargin
    +    }
    +
    +    s"${ctx.JAVA_INT} $childResult = 0;\n" + ctx.splitExpressions(
    --- End diff --
    
    No need to check `ctx.INPUT_ROW == null || ctx.currentVars != null` here?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19878#discussion_r154682872
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala ---
    @@ -270,17 +270,36 @@ abstract class HashExpression[E] extends Expression {
     
       override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
         ev.isNull = "false"
    -    val childrenHash = ctx.splitExpressions(children.map { child =>
    +
    +    val childrenHash = children.map { child =>
           val childGen = child.genCode(ctx)
           childGen.code + ctx.nullSafeExec(child.nullable, childGen.isNull) {
             computeHash(childGen.value, child.dataType, ev.value, ctx)
           }
    -    })
    +    }
    +
    +    val hashResultType = ctx.javaType(dataType)
    +    val codes = if (ctx.INPUT_ROW == null || ctx.currentVars != null) {
    --- End diff --
    
    This pattern appears many times in the code base, we may need to create a `ctx.splitExpressionsWithCurrentInput` for it later.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...

Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19878#discussion_r154687417
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala ---
    @@ -389,13 +408,21 @@ abstract class HashExpression[E] extends Expression {
           input: String,
           result: String,
           fields: Array[StructField]): String = {
    -    val hashes = fields.zipWithIndex.map { case (field, index) =>
    +    val fieldsHash = fields.zipWithIndex.map { case (field, index) =>
           nullSafeElementHash(input, index.toString, field.nullable, field.dataType, result, ctx)
         }
    +    val hashResultType = ctx.javaType(dataType)
    --- End diff --
    
    nit: this is done also in line 281. Can we do this only once? maybe with a `lazy val`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19878#discussion_r154818350
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala ---
    @@ -610,25 +637,44 @@ case class HiveHash(children: Seq[Expression]) extends HashExpression[Int] {
     
       override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
         ev.isNull = "false"
    +
         val childHash = ctx.freshName("childHash")
    -    val childrenHash = ctx.splitExpressions(children.map { child =>
    +    val childrenHash = children.map { child =>
           val childGen = child.genCode(ctx)
           val codeToComputeHash = ctx.nullSafeExec(child.nullable, childGen.isNull) {
             computeHash(childGen.value, child.dataType, childHash, ctx)
           }
           s"""
              |${childGen.code}
    +         |$childHash = 0;
              |$codeToComputeHash
              |${ev.value} = (31 * ${ev.value}) + $childHash;
    -         |$childHash = 0;
            """.stripMargin
    -    })
    +    }
     
    -    ctx.addMutableState(ctx.javaType(dataType), ev.value)
    -    ctx.addMutableState(ctx.JAVA_INT, childHash, s"$childHash = 0;")
    -    ev.copy(code = s"""
    -      ${ev.value} = $seed;
    -      $childrenHash""")
    +    val codes = if (ctx.INPUT_ROW == null || ctx.currentVars != null) {
    +      childrenHash.mkString("\n")
    +    } else {
    +      ctx.splitExpressions(
    +        expressions = childrenHash,
    +        funcName = "computeHash",
    +        arguments = Seq("InternalRow" -> ctx.INPUT_ROW, ctx.JAVA_INT -> ev.value),
    +        returnType = ctx.JAVA_INT,
    +        makeSplitFunction = body =>
    +          s"""
    +             |${ctx.JAVA_INT} $childHash = 0;
    +             |$body
    +             |return ${ev.value};
    +           """.stripMargin,
    +        foldFunctions = _.map(funcCall => s"${ev.value} = $funcCall;").mkString("\n"))
    +    }
    +
    +    ev.copy(code =
    +      s"""
    +         |${ctx.JAVA_INT} ${ev.value} = $seed;
    +         |${ctx.JAVA_INT} $childHash = 0;
    --- End diff --
    
    nit: `childHash` is only needed to declare here when we don't split functions.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19878#discussion_r154683889
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala ---
    @@ -730,23 +776,29 @@ case class HiveHash(children: Seq[Expression]) extends HashExpression[Int] {
           input: String,
           result: String,
           fields: Array[StructField]): String = {
    -    val localResult = ctx.freshName("localResult")
         val childResult = ctx.freshName("childResult")
    -    fields.zipWithIndex.map { case (field, index) =>
    +    val fieldsHash = fields.zipWithIndex.map { case (field, index) =>
    +      val computeFieldHash = nullSafeElementHash(
    +        input, index.toString, field.nullable, field.dataType, childResult, ctx)
           s"""
    -         $childResult = 0;
    -         ${nullSafeElementHash(input, index.toString, field.nullable, field.dataType,
    -           childResult, ctx)}
    -         $localResult = (31 * $localResult) + $childResult;
    -       """
    -    }.mkString(
    --- End diff --
    
    We forgot to split the code for computing hive hash of struct, it's fixed now.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19878: [SPARK-22682][SQL] HashExpression does not need to creat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19878
  
    **[Test build #84431 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84431/testReport)** for PR 19878 at commit [`0e9998e`](https://github.com/apache/spark/commit/0e9998e0704b54d8f1352a1936c9b6367ebee15e).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19878


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19878#discussion_r154691937
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala ---
    @@ -389,13 +408,21 @@ abstract class HashExpression[E] extends Expression {
           input: String,
           result: String,
           fields: Array[StructField]): String = {
    -    val hashes = fields.zipWithIndex.map { case (field, index) =>
    +    val fieldsHash = fields.zipWithIndex.map { case (field, index) =>
           nullSafeElementHash(input, index.toString, field.nullable, field.dataType, result, ctx)
         }
    +    val hashResultType = ctx.javaType(dataType)
    --- End diff --
    
    `ctx` is only available inside `doGenCode`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org