You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2017/12/04 15:30:40 UTC
[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...
GitHub user cloud-fan opened a pull request:
https://github.com/apache/spark/pull/19878
[SPARK-22682][SQL] HashExpression does not need to create global variables
## What changes were proposed in this pull request?
It turns out that `HashExpression` can pass around some values via parameter when splitting codes into methods, to save some global variable slots.
This can also prevent a weird case that global variable appears in parameter list, which is discovered by https://github.com/apache/spark/pull/19865
## How was this patch tested?
existing tests
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/cloud-fan/spark minor
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19878.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19878
----
commit 0e9998e0704b54d8f1352a1936c9b6367ebee15e
Author: Wenchen Fan <we...@databricks.com>
Date: 2017-12-04T15:24:46Z
HashExpression does not need to create global variables
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...
Posted by gatorsmile <gi...@git.apache.org>.
Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/19878#discussion_r154805001
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala ---
@@ -270,17 +270,36 @@ abstract class HashExpression[E] extends Expression {
override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
ev.isNull = "false"
- val childrenHash = ctx.splitExpressions(children.map { child =>
+
+ val childrenHash = children.map { child =>
val childGen = child.genCode(ctx)
childGen.code + ctx.nullSafeExec(child.nullable, childGen.isNull) {
computeHash(childGen.value, child.dataType, ev.value, ctx)
}
- })
+ }
+
+ val hashResultType = ctx.javaType(dataType)
+ val codes = if (ctx.INPUT_ROW == null || ctx.currentVars != null) {
--- End diff --
That one has been merged, but this one is still different.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19878: [SPARK-22682][SQL] HashExpression does not need to creat...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19878
**[Test build #84431 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84431/testReport)** for PR 19878 at commit [`0e9998e`](https://github.com/apache/spark/commit/0e9998e0704b54d8f1352a1936c9b6367ebee15e).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19878: [SPARK-22682][SQL] HashExpression does not need to creat...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19878
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84431/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19878#discussion_r154845901
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala ---
@@ -730,23 +776,29 @@ case class HiveHash(children: Seq[Expression]) extends HashExpression[Int] {
input: String,
result: String,
fields: Array[StructField]): String = {
- val localResult = ctx.freshName("localResult")
val childResult = ctx.freshName("childResult")
- fields.zipWithIndex.map { case (field, index) =>
+ val fieldsHash = fields.zipWithIndex.map { case (field, index) =>
+ val computeFieldHash = nullSafeElementHash(
+ input, index.toString, field.nullable, field.dataType, childResult, ctx)
s"""
- $childResult = 0;
- ${nullSafeElementHash(input, index.toString, field.nullable, field.dataType,
- childResult, ctx)}
- $localResult = (31 * $localResult) + $childResult;
- """
- }.mkString(
- s"""
- int $localResult = 0;
- int $childResult = 0;
- """,
- "",
- s"$result = (31 * $result) + $localResult;"
- )
+ |$childResult = 0;
+ |$computeFieldHash
+ |$result = (31 * $result) + $childResult;
+ """.stripMargin
+ }
+
+ s"${ctx.JAVA_INT} $childResult = 0;\n" + ctx.splitExpressions(
--- End diff --
Yea, the input here is a row that may be produced by `row.getStruct` instead of `ctx.INPUT_ROW`, so we don't need this check as the input won't be `ctx.currentVars`.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19878#discussion_r154683716
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala ---
@@ -270,17 +270,36 @@ abstract class HashExpression[E] extends Expression {
override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
ev.isNull = "false"
- val childrenHash = ctx.splitExpressions(children.map { child =>
+
+ val childrenHash = children.map { child =>
val childGen = child.genCode(ctx)
childGen.code + ctx.nullSafeExec(child.nullable, childGen.isNull) {
computeHash(childGen.value, child.dataType, ev.value, ctx)
}
- })
+ }
+
+ val hashResultType = ctx.javaType(dataType)
+ val codes = if (ctx.INPUT_ROW == null || ctx.currentVars != null) {
--- End diff --
I think @kiszk is doing this
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/19878#discussion_r154819030
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala ---
@@ -610,25 +637,44 @@ case class HiveHash(children: Seq[Expression]) extends HashExpression[Int] {
override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
ev.isNull = "false"
+
val childHash = ctx.freshName("childHash")
- val childrenHash = ctx.splitExpressions(children.map { child =>
+ val childrenHash = children.map { child =>
val childGen = child.genCode(ctx)
val codeToComputeHash = ctx.nullSafeExec(child.nullable, childGen.isNull) {
computeHash(childGen.value, child.dataType, childHash, ctx)
}
s"""
|${childGen.code}
+ |$childHash = 0;
|$codeToComputeHash
|${ev.value} = (31 * ${ev.value}) + $childHash;
- |$childHash = 0;
""".stripMargin
- })
+ }
- ctx.addMutableState(ctx.javaType(dataType), ev.value)
- ctx.addMutableState(ctx.JAVA_INT, childHash, s"$childHash = 0;")
- ev.copy(code = s"""
- ${ev.value} = $seed;
- $childrenHash""")
+ val codes = if (ctx.INPUT_ROW == null || ctx.currentVars != null) {
+ childrenHash.mkString("\n")
+ } else {
+ ctx.splitExpressions(
+ expressions = childrenHash,
+ funcName = "computeHash",
+ arguments = Seq("InternalRow" -> ctx.INPUT_ROW, ctx.JAVA_INT -> ev.value),
+ returnType = ctx.JAVA_INT,
+ makeSplitFunction = body =>
+ s"""
+ |${ctx.JAVA_INT} $childHash = 0;
+ |$body
+ |return ${ev.value};
+ """.stripMargin,
+ foldFunctions = _.map(funcCall => s"${ev.value} = $funcCall;").mkString("\n"))
+ }
+
+ ev.copy(code =
+ s"""
+ |${ctx.JAVA_INT} ${ev.value} = $seed;
+ |${ctx.JAVA_INT} $childHash = 0;
--- End diff --
nvm, `splitExpressions` could possibly not split expressions if only one block.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19878: [SPARK-22682][SQL] HashExpression does not need to creat...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/19878
left only one very minor comment, it LGTM
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19878: [SPARK-22682][SQL] HashExpression does not need to creat...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19878
thanks, merging to master!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19878: [SPARK-22682][SQL] HashExpression does not need to creat...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:
https://github.com/apache/spark/pull/19878
LGTM
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19878: [SPARK-22682][SQL] HashExpression does not need to creat...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19878
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19878: [SPARK-22682][SQL] HashExpression does not need to creat...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19878
cc @kiszk @mgaido91 @viirya @gatorsmile
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/19878#discussion_r154819410
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala ---
@@ -730,23 +776,29 @@ case class HiveHash(children: Seq[Expression]) extends HashExpression[Int] {
input: String,
result: String,
fields: Array[StructField]): String = {
- val localResult = ctx.freshName("localResult")
val childResult = ctx.freshName("childResult")
- fields.zipWithIndex.map { case (field, index) =>
+ val fieldsHash = fields.zipWithIndex.map { case (field, index) =>
+ val computeFieldHash = nullSafeElementHash(
+ input, index.toString, field.nullable, field.dataType, childResult, ctx)
s"""
- $childResult = 0;
- ${nullSafeElementHash(input, index.toString, field.nullable, field.dataType,
- childResult, ctx)}
- $localResult = (31 * $localResult) + $childResult;
- """
- }.mkString(
- s"""
- int $localResult = 0;
- int $childResult = 0;
- """,
- "",
- s"$result = (31 * $result) + $localResult;"
- )
+ |$childResult = 0;
+ |$computeFieldHash
+ |$result = (31 * $result) + $childResult;
+ """.stripMargin
+ }
+
+ s"${ctx.JAVA_INT} $childResult = 0;\n" + ctx.splitExpressions(
--- End diff --
No need to check `ctx.INPUT_ROW == null || ctx.currentVars != null` here?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19878#discussion_r154682872
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala ---
@@ -270,17 +270,36 @@ abstract class HashExpression[E] extends Expression {
override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
ev.isNull = "false"
- val childrenHash = ctx.splitExpressions(children.map { child =>
+
+ val childrenHash = children.map { child =>
val childGen = child.genCode(ctx)
childGen.code + ctx.nullSafeExec(child.nullable, childGen.isNull) {
computeHash(childGen.value, child.dataType, ev.value, ctx)
}
- })
+ }
+
+ val hashResultType = ctx.javaType(dataType)
+ val codes = if (ctx.INPUT_ROW == null || ctx.currentVars != null) {
--- End diff --
This pattern appears many times in the code base, we may need to create a `ctx.splitExpressionsWithCurrentInput` for it later.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...
Posted by mgaido91 <gi...@git.apache.org>.
Github user mgaido91 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19878#discussion_r154687417
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala ---
@@ -389,13 +408,21 @@ abstract class HashExpression[E] extends Expression {
input: String,
result: String,
fields: Array[StructField]): String = {
- val hashes = fields.zipWithIndex.map { case (field, index) =>
+ val fieldsHash = fields.zipWithIndex.map { case (field, index) =>
nullSafeElementHash(input, index.toString, field.nullable, field.dataType, result, ctx)
}
+ val hashResultType = ctx.javaType(dataType)
--- End diff --
nit: this is done also in line 281. Can we do this only once? maybe with a `lazy val`?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...
Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/19878#discussion_r154818350
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala ---
@@ -610,25 +637,44 @@ case class HiveHash(children: Seq[Expression]) extends HashExpression[Int] {
override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
ev.isNull = "false"
+
val childHash = ctx.freshName("childHash")
- val childrenHash = ctx.splitExpressions(children.map { child =>
+ val childrenHash = children.map { child =>
val childGen = child.genCode(ctx)
val codeToComputeHash = ctx.nullSafeExec(child.nullable, childGen.isNull) {
computeHash(childGen.value, child.dataType, childHash, ctx)
}
s"""
|${childGen.code}
+ |$childHash = 0;
|$codeToComputeHash
|${ev.value} = (31 * ${ev.value}) + $childHash;
- |$childHash = 0;
""".stripMargin
- })
+ }
- ctx.addMutableState(ctx.javaType(dataType), ev.value)
- ctx.addMutableState(ctx.JAVA_INT, childHash, s"$childHash = 0;")
- ev.copy(code = s"""
- ${ev.value} = $seed;
- $childrenHash""")
+ val codes = if (ctx.INPUT_ROW == null || ctx.currentVars != null) {
+ childrenHash.mkString("\n")
+ } else {
+ ctx.splitExpressions(
+ expressions = childrenHash,
+ funcName = "computeHash",
+ arguments = Seq("InternalRow" -> ctx.INPUT_ROW, ctx.JAVA_INT -> ev.value),
+ returnType = ctx.JAVA_INT,
+ makeSplitFunction = body =>
+ s"""
+ |${ctx.JAVA_INT} $childHash = 0;
+ |$body
+ |return ${ev.value};
+ """.stripMargin,
+ foldFunctions = _.map(funcCall => s"${ev.value} = $funcCall;").mkString("\n"))
+ }
+
+ ev.copy(code =
+ s"""
+ |${ctx.JAVA_INT} ${ev.value} = $seed;
+ |${ctx.JAVA_INT} $childHash = 0;
--- End diff --
nit: `childHash` is only needed to declare here when we don't split functions.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19878#discussion_r154683889
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala ---
@@ -730,23 +776,29 @@ case class HiveHash(children: Seq[Expression]) extends HashExpression[Int] {
input: String,
result: String,
fields: Array[StructField]): String = {
- val localResult = ctx.freshName("localResult")
val childResult = ctx.freshName("childResult")
- fields.zipWithIndex.map { case (field, index) =>
+ val fieldsHash = fields.zipWithIndex.map { case (field, index) =>
+ val computeFieldHash = nullSafeElementHash(
+ input, index.toString, field.nullable, field.dataType, childResult, ctx)
s"""
- $childResult = 0;
- ${nullSafeElementHash(input, index.toString, field.nullable, field.dataType,
- childResult, ctx)}
- $localResult = (31 * $localResult) + $childResult;
- """
- }.mkString(
--- End diff --
We forgot to split the code for computing hive hash of struct, it's fixed now.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19878: [SPARK-22682][SQL] HashExpression does not need to creat...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19878
**[Test build #84431 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84431/testReport)** for PR 19878 at commit [`0e9998e`](https://github.com/apache/spark/commit/0e9998e0704b54d8f1352a1936c9b6367ebee15e).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/19878
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19878#discussion_r154691937
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala ---
@@ -389,13 +408,21 @@ abstract class HashExpression[E] extends Expression {
input: String,
result: String,
fields: Array[StructField]): String = {
- val hashes = fields.zipWithIndex.map { case (field, index) =>
+ val fieldsHash = fields.zipWithIndex.map { case (field, index) =>
nullSafeElementHash(input, index.toString, field.nullable, field.dataType, result, ctx)
}
+ val hashResultType = ctx.javaType(dataType)
--- End diff --
`ctx` is only available inside `doGenCode`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org