You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2015/12/22 16:27:01 UTC

[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/10435

    [SPARK-12480][SQL] add Hash expression that can calculate hash value for a group of expressions

    The hash algorithm is based on hive's bucketing hash function, see https://github.com/apache/hive/blob/release-1.2.0/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java#L421-L423 and https://github.com/apache/hive/blob/release-1.2.0/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorUtils.java#L497-L607

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark hash-expr

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10435.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10435
    
----
commit 53b0ec57947130b56d374708b03cb706ab21c0ae
Author: Wenchen Fan <we...@databricks.com>
Date:   2015-12-22T15:22:59Z

    add hash expression

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167063931
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48587902
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala ---
    @@ -176,3 +179,229 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp
         })
       }
     }
    +
    +/**
    + * A function that calculates hash value for a group of expressions.
    + *
    + * The hash value for an expression depends on its type:
    + *  - null:               0
    --- End diff --
    
    If we're going to change this, we should use a different value for null. Pick a large random number instead.
    
    0 will be computed as the hash for more reasonable data (e.g. more likely an int column contains 0 than a large prime) and we can cheaply reduce some collisions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167710215
  
    **[Test build #48388 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48388/consoleFull)** for PR 10435 at commit [`6311aa7`](https://github.com/apache/spark/commit/6311aa75a7a41fee8464ee96e5949ccad3e7d7a5).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167403091
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48352/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48778819
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala ---
    @@ -177,3 +179,44 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp
         })
       }
     }
    +
    +/**
    + * A function that calculates hash value for a group of expressions.
    + *
    + * Internally this function will write arguments into an [[UnsafeRow]], and calculate hash code of
    + * the unsafe row using murmur3 hasher with a seed.
    + * We should use this hash function for both shuffle and bucket, so that we can guarantee shuffle
    + * and bucketing have same data distribution.
    + */
    +case class Murmur3Hash(children: Seq[Expression], seed: Int) extends Expression {
    +  def this(arguments: Seq[Expression]) = this(arguments, 42)
    --- End diff --
    
    I think this is fine.
    
    Can you file a follow up jira to look at this again? I think we want to remove the projection to unsafe row soon (before we ship this and persist metadata that way). This should be decoupled from unsafe row ideally. For example, if the row is (int, double, string): the generated hash function shoudl be something like
    
    int hash = seed;
    hash = murmur3(getInt(0), hash)
    hash = murmur3(getDouble(1), hash)
    hash = murmur3(getString(2), hash)
    return hash
    
    This is likely not the currently computed hash value so can't defer this for too long.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167328998
  
    **[Test build #48346 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48346/consoleFull)** for PR 10435 at commit [`303b69b`](https://github.com/apache/spark/commit/303b69b243fb2ac5f79c948c0008592b8c57fc25).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167737270
  
    **[Test build #48404 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48404/consoleFull)** for PR 10435 at commit [`1cdb2bc`](https://github.com/apache/spark/commit/1cdb2bcc1ede58fdd9c1e98bff4b5544b8a6e74e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167033129
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48554681
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala ---
    @@ -176,3 +179,223 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp
         })
       }
     }
    +
    +/**
    + * A function that calculates hash value for a group of expressions.
    + *
    + * The hash value for an expression depends on its type:
    + *  - null:               0
    + *  - boolean:            0 for true, 1 for false.
    --- End diff --
    
    Let's also add comments to explain the benefit of this function.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167396601
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48349/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167452730
  
    **[Test build #48356 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48356/consoleFull)** for PR 10435 at commit [`655800c`](https://github.com/apache/spark/commit/655800cd72e5fabc256cd64a82f6ccd60b491f92).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `case class Hash(children: Seq[Expression]) extends Expression `\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-168856535
  
    **[Test build #48699 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48699/consoleFull)** for PR 10435 at commit [`9a978c4`](https://github.com/apache/spark/commit/9a978c411e9799be1f3b76e6ea20dba281efc4e7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167033130
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48272/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167932253
  
    **[Test build #48442 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48442/consoleFull)** for PR 10435 at commit [`8703b1a`](https://github.com/apache/spark/commit/8703b1a127235c49614d326334548f125b81383b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167452744
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48705398
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala ---
    @@ -278,6 +278,7 @@ object FunctionRegistry {
         // misc functions
         expression[Crc32]("crc32"),
         expression[Md5]("md5"),
    +    expression[Murmur3Hash]("murmur3_hash"),
    --- End diff --
    
    sounds good to me, let me try it out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167251758
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-168876055
  
    **[Test build #48699 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48699/consoleFull)** for PR 10435 at commit [`9a978c4`](https://github.com/apache/spark/commit/9a978c411e9799be1f3b76e6ea20dba281efc4e7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-166646764
  
    cc @yhuai @rxin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48527967
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala ---
    @@ -176,3 +179,223 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp
         })
       }
     }
    +
    +/**
    + * A function that calculates hash value for a group of expressions.
    + *
    + * The hash value for an expression depends on its type:
    + *  - null:               0
    + *  - boolean:            0 for true, 1 for false.
    --- End diff --
    
    Looks like in Hive hash value of boolean is 1 for true and 0 for false?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-168128675
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167292151
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48337/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167063932
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48291/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48516214
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala ---
    @@ -176,3 +178,221 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp
         })
       }
     }
    +
    +/**
    + * A function that calculates hash value for a group of expressions.
    + *
    + * The hash value for an expression depends on its type:
    + *  - null:               0
    + *  - boolean:            1 for true, 0 for false.
    + *  - byte, short, int:   the input itself.
    + *  - long:               input XOR (input >>> 32)
    + *  - float:              java.lang.Float.floatToIntBits(input)
    + *  - double:             l = java.lang.Double.doubleToLongBits(input); l XOR (l >>> 32)
    + *  - binary:             java.util.Arrays.hashCode(input)
    + *  - array:              recursively calculate hash value for each element, and aggregate them by
    + *                        `result = result * 31 + elementHash` with an initial value `result = 0`.
    + *  - map:                recursively calculate hash value for each key-value pair, and aggregate
    + *                        them by `result += keyHash XOR valueHash`.
    + *  - struct:             similar to array, calculate hash value for each field and aggregate them.
    + *  - other type:         input.hashCode().
    + *                        e.g. calculate hash value for string type by `UTF8String.hashCode()`.
    + * Finally we aggregate the hash values for each expression by `result = result * 31 + exprHash`.
    + *
    + * This hash algorithm follows hive's bucketing hash function, so that our bucketing function can
    + * be compatible with hive's, e.g. we can benefit from bucketing even the data source is mixed with
    + * hive tables.
    + */
    +case class Hash(children: Seq[Expression]) extends Expression {
    +
    +  override def dataType: DataType = IntegerType
    +
    +  override def foldable: Boolean = children.forall(_.foldable)
    +
    +  override def nullable: Boolean = false
    +
    +  override def eval(input: InternalRow): Any = {
    +    var result = 0
    +    for (e <- children) {
    +      val hashValue = computeHash(e.eval(input), e.dataType)
    +      result = result * 31 + hashValue
    +    }
    +    result
    +  }
    +
    +  private def computeHash(v: Any, dataType: DataType): Int = v match {
    +    case null => 0
    +    case b: Boolean => if (b) 1 else 0
    +    case b: Byte => b.toInt
    +    case s: Short => s.toInt
    +    case i: Int => i
    +    case l: Long => (l ^ (l >>> 32)).toInt
    +    case f: Float => java.lang.Float.floatToIntBits(f)
    +    case d: Double =>
    +      val b = java.lang.Double.doubleToLongBits(d)
    +      (b ^ (b >>> 32)).toInt
    +    case a: Array[Byte] => java.util.Arrays.hashCode(a)
    +    case s: UTF8String => s.toString.hashCode
    --- End diff --
    
    @marmbrus  I think we will only use the codegen version, should we remove this branch and throw exception if it's called?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48482588
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala ---
    @@ -176,3 +178,221 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp
         })
       }
     }
    +
    +/**
    + * A function that calculates hash value for a group of expressions.
    + *
    + * The hash value for an expression depends on its type:
    + *  - null:               0
    + *  - boolean:            1 for true, 0 for false.
    + *  - byte, short, int:   the input itself.
    + *  - long:               input XOR (input >>> 32)
    + *  - float:              java.lang.Float.floatToIntBits(input)
    + *  - double:             l = java.lang.Double.doubleToLongBits(input); l XOR (l >>> 32)
    + *  - binary:             java.util.Arrays.hashCode(input)
    + *  - array:              recursively calculate hash value for each element, and aggregate them by
    + *                        `result = result * 31 + elementHash` with an initial value `result = 0`.
    + *  - map:                recursively calculate hash value for each key-value pair, and aggregate
    + *                        them by `result += keyHash XOR valueHash`.
    + *  - struct:             similar to array, calculate hash value for each field and aggregate them.
    + *  - other type:         input.hashCode().
    + *                        e.g. calculate hash value for string type by `UTF8String.hashCode()`.
    + * Finally we aggregate the hash values for each expression by `result = result * 31 + exprHash`.
    + *
    + * This hash algorithm follows hive's bucketing hash function, so that our bucketing function can
    + * be compatible with hive's, e.g. we can benefit from bucketing even the data source is mixed with
    + * hive tables.
    + */
    +case class Hash(children: Seq[Expression]) extends Expression {
    +
    +  override def dataType: DataType = IntegerType
    +
    +  override def foldable: Boolean = children.forall(_.foldable)
    +
    +  override def nullable: Boolean = false
    +
    +  override def eval(input: InternalRow): Any = {
    +    var result = 0
    +    for (e <- children) {
    +      val hashValue = computeHash(e.eval(input), e.dataType)
    +      result = result * 31 + hashValue
    +    }
    +    result
    +  }
    +
    +  private def computeHash(v: Any, dataType: DataType): Int = v match {
    +    case null => 0
    +    case b: Boolean => if (b) 1 else 0
    +    case b: Byte => b.toInt
    +    case s: Short => s.toInt
    +    case i: Int => i
    +    case l: Long => (l ^ (l >>> 32)).toInt
    +    case f: Float => java.lang.Float.floatToIntBits(f)
    +    case d: Double =>
    +      val b = java.lang.Double.doubleToLongBits(d)
    +      (b ^ (b >>> 32)).toInt
    +    case a: Array[Byte] => java.util.Arrays.hashCode(a)
    +    case s: UTF8String => s.toString.hashCode
    --- End diff --
    
    To follow hive, I turn `UTF8String` to `String` first and then call `hashCode`, but I'm a little worried about this:
    
    * this is definitely slower than just `UTF8String.hashCode`, and it's a critical path that we will run it for every row during `Exchange`, will it hurt performance?
    * hive has string type, varchar type and char type, and they have [different hash code](https://github.com/apache/hive/blob/release-1.2.0/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorUtils.java#L527-L541), but we only have `StringType`, which is hard to match all of them.
    
    cc @nongli @yhuai @marmbrus @rxin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48646183
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala ---
    @@ -177,3 +179,44 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp
         })
       }
     }
    +
    +/**
    + * A function that calculates hash value for a group of expressions.
    + *
    + * Internally this function will write arguments into an [[UnsafeRow]], and calculate hash code of
    + * the unsafe row using murmur3 hasher with a seed.
    + * We should use this hash function for both shuffle and bucket, so that we can guarantee shuffle
    + * and bucketing have same data distribution.
    + */
    +case class Murmur3Hash(children: Seq[Expression], seed: Int) extends Expression {
    +  def this(arguments: Seq[Expression]) = this(arguments, 42)
    --- End diff --
    
    use 42 as default seed, which is same with `UnsafeRow.hashCode`, should we make `42` a constant variable in `Murmur3_x86_32`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167338506
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48528123
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala ---
    @@ -176,3 +179,223 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp
         })
       }
     }
    +
    +/**
    + * A function that calculates hash value for a group of expressions.
    + *
    + * The hash value for an expression depends on its type:
    + *  - null:               0
    + *  - boolean:            0 for true, 1 for false.
    + *  - byte, short, int:   the input itself.
    + *  - long:               input XOR (input >>> 32)
    + *  - float:              java.lang.Float.floatToIntBits(input)
    + *  - double:             l = java.lang.Double.doubleToLongBits(input); l XOR (l >>> 32)
    + *  - binary:             java.util.Arrays.hashCode(input)
    + *  - array:              recursively calculate hash value for each element, and aggregate them by
    + *                        `result = result * 37 + elementHash` with an initial value `result = 37`.
    + *  - map:                recursively calculate hash value for each key-value pair, and aggregate
    + *                        them by `result += keyHash XOR valueHash`.
    + *  - struct:             similar to array, calculate hash value for each field and aggregate them.
    + *  - other type:         input.hashCode().
    + *                        e.g. calculate hash value for string type by `UTF8String.hashCode()`.
    + * Finally we aggregate the hash values for each expression by the same way of array.
    + */
    +case class Hash(children: Seq[Expression]) extends Expression {
    +
    +  override def dataType: DataType = IntegerType
    +
    +  override def foldable: Boolean = children.forall(_.foldable)
    +
    +  override def nullable: Boolean = false
    +
    +  override def checkInputDataTypes(): TypeCheckResult = {
    +    if (children.isEmpty) {
    +      TypeCheckResult.TypeCheckFailure("input to function hash cannot be empty")
    +    } else {
    +      TypeCheckResult.TypeCheckSuccess
    +    }
    +  }
    +
    +  override def eval(input: InternalRow): Any = {
    +    var result = 37
    +    for (e <- children) {
    +      val hashValue = computeHash(e.eval(input), e.dataType)
    +      result = result * 37 + hashValue
    +    }
    +    result
    +  }
    +
    +  private def computeHash(v: Any, dataType: DataType): Int = v match {
    +    case null => 0
    +    case b: Boolean => if (b) 0 else 1
    +    case b: Byte => b.toInt
    +    case s: Short => s.toInt
    +    case i: Int => i
    +    case l: Long => (l ^ (l >>> 32)).toInt
    +    case f: Float => java.lang.Float.floatToIntBits(f)
    +    case d: Double =>
    +      val b = java.lang.Double.doubleToLongBits(d)
    +      (b ^ (b >>> 32)).toInt
    +    case a: Array[Byte] => java.util.Arrays.hashCode(a)
    +
    +    case array: ArrayData =>
    +      val elementType = dataType.asInstanceOf[ArrayType].elementType
    +      var result = 0
    +      var i = 0
    +      while (i < array.numElements()) {
    +        val hashValue = computeHash(array.get(i, elementType), elementType)
    +        result = result * 37 + hashValue
    --- End diff --
    
    37? Looks like Hive uses 31.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167727788
  
    **[Test build #48399 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48399/consoleFull)** for PR 10435 at commit [`84902bb`](https://github.com/apache/spark/commit/84902bb2b9488212e0006bf359135073b8a9e496).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167396686
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48350/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167747465
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-166901052
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167283801
  
    **[Test build #48337 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48337/consoleFull)** for PR 10435 at commit [`f408f1f`](https://github.com/apache/spark/commit/f408f1ff0e9b4839a48dea56a17ca4755ad8f717).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167583523
  
    need some help about the R test, I can't see the expected result through the error log...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167042309
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-168047956
  
    On the contrary I think we should consider not using hash code on an object, and always use hash code expression for two reasons:
    
    1. We still need a hash function
    
    2. We get code gen using an expression
    
    3. It is easier to control (being able to pass a seed or use it for bloom filters)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-168614906
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48301376
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala ---
    @@ -161,3 +163,197 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp
         })
       }
     }
    +
    +/**
    + * A function that calculates hash value for a group of expressions.
    + */
    +case class Hash(children: Seq[Expression]) extends Expression {
    --- End diff --
    
    Is it possible to remove the code duplication with BaseGenericInternalRow.hashCode?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48342878
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala ---
    @@ -114,6 +114,7 @@ private[sql] object StatFunctions extends Logging {
           if (element == null) "null" else element.toString
         }
         // get the distinct values of column 2, so that we can make them the column names
    +    // ???
         val distinctCol2: Map[Any, Int] =
           counts.map(e => cleanElement(e.get(1))).distinct.zipWithIndex.toMap
    --- End diff --
    
    The `counts` is the result of aggregation without sort, so the order of distinct values of column 2 is non-deterministic. This means, when we use `distinctCol2` to generate the field names of output schema, the order of field names is also non-deterministic.
    
    When I changed the hash function in `Exchange`, one crosstable test failed because the field names don't match. I simply updated that test, need someone to confirm this is the right thing to do.
    cc @rxin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167698232
  
    **[Test build #48388 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48388/consoleFull)** for PR 10435 at commit [`6311aa7`](https://github.com/apache/spark/commit/6311aa75a7a41fee8464ee96e5949ccad3e7d7a5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48705356
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala ---
    @@ -278,6 +278,7 @@ object FunctionRegistry {
         // misc functions
         expression[Crc32]("crc32"),
         expression[Md5]("md5"),
    +    expression[Murmur3Hash]("murmur3_hash"),
    --- End diff --
    
    maybe we can have a flag to control this -- when in hive compatibility test, fall back to Hive's, and otherwise our own?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167033976
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-168128676
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48540/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-168876234
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48699/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167710233
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48388/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167022431
  
    **[Test build #48274 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48274/consoleFull)** for PR 10435 at commit [`207ca84`](https://github.com/apache/spark/commit/207ca84dca5a607a82ec7ff46a209959a9cfa312).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-168801707
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167730352
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167399420
  
    **[Test build #48352 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48352/consoleFull)** for PR 10435 at commit [`a629e75`](https://github.com/apache/spark/commit/a629e754be3c087466de9f3b0aa8634e8d640ea0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48370011
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala ---
    @@ -114,6 +114,7 @@ private[sql] object StatFunctions extends Logging {
           if (element == null) "null" else element.toString
         }
         // get the distinct values of column 2, so that we can make them the column names
    +    // ???
         val distinctCol2: Map[Any, Int] =
           counts.map(e => cleanElement(e.get(1))).distinct.zipWithIndex.toMap
    --- End diff --
    
    That's fine. Actually maybe we should also inject a sort to create deterministic ordering.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167251754
  
    **[Test build #48325 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48325/consoleFull)** for PR 10435 at commit [`04a7301`](https://github.com/apache/spark/commit/04a730154283a6125f05ed984115adf2e455ac60).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `case class Hash(children: Seq[Expression]) extends Expression `\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167396684
  
    **[Test build #48350 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48350/consoleFull)** for PR 10435 at commit [`c130097`](https://github.com/apache/spark/commit/c130097e3f12a0d45eed2e2d5eb6682d65f11c9a).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `case class Hash(children: Seq[Expression]) extends Expression `\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167063904
  
    **[Test build #48291 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48291/consoleFull)** for PR 10435 at commit [`fcb5af9`](https://github.com/apache/spark/commit/fcb5af9b4af4d4885bb42ca11eba4b33a91ccc81).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `case class Hash(children: Seq[Expression]) extends Expression `\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167710232
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167614804
  
    I see
    ```
    1. Failure (at test_sparkSQL.R#1160): group by, agg functions ------------------
    30 not equal to collect(max(gd))[1, 2]
    30 - 19 == 11
    
    2. Failure (at test_sparkSQL.R#1635): crosstab() on a DataFrame ----------------
    expected is not identical to ordered. Differences: 
    Names: 2 string mismatches
    Component 2: Mean relative difference: 2
    Component 3: Mean relative difference: 2
    Error: Test failures
    Execution halted
    Had test failures; see logs.
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48588221
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala ---
    @@ -176,3 +179,229 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp
         })
       }
     }
    +
    +/**
    + * A function that calculates hash value for a group of expressions.
    + *
    + * The hash value for an expression depends on its type:
    + *  - null:               0
    + *  - boolean:            0 for true, 1 for false.
    + *  - byte, short, int:   the input itself.
    + *  - long:               input XOR (input >>> 32)
    + *  - float:              java.lang.Float.floatToIntBits(input)
    + *  - double:             l = java.lang.Double.doubleToLongBits(input); l XOR (l >>> 32)
    + *  - binary:             java.util.Arrays.hashCode(input)
    + *  - array:              recursively calculate hash value for each element, and aggregate them by
    + *                        `result = result * 37 + elementHash` with an initial value `result = 37`.
    + *  - map:                recursively calculate hash value for each key-value pair, and aggregate
    + *                        them by `result += keyHash XOR valueHash`.
    + *  - struct:             similar to array, calculate hash value for each field and aggregate them.
    + *  - other type:         input.hashCode().
    + *                        e.g. calculate hash value for string type by `UTF8String.hashCode()`.
    + * Finally we aggregate the hash values for each expression by the same way of array.
    + *
    + * This hash algorithm is basically same with `GenericInternalRow.hashCode`, but using this hash
    + * expression is better as it can produce consistent hash values between safe and unsafe data
    + * structure, and can be slightly faster by codegen.
    + * It's also the hash function for both shuffle and bucketing, so that we can guarantee shuffle and
    + * bucketing have same data distribution.
    + */
    +case class Hash(children: Seq[Expression]) extends Expression {
    --- End diff --
    
    What if we just turned this into Mumur3Hash instead?
    
    This would just do UnsafeProjection.create()
    project(input).hashCode()
    
    Murmur3 will give us much nicer hashing properties. The current hash function can be bad in reasonable cases. 
    
    For example, if the long column is a timestamp in milis from a source that samples every second. Most of the low digits will be similar (e.g. values are 1000, 2002, 2999, etc. Very few that end in 500). The hash function does a very bad job of breaking this up and this will generate some very skewed partitions.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48589312
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala ---
    @@ -176,3 +179,229 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp
         })
       }
     }
    +
    +/**
    + * A function that calculates hash value for a group of expressions.
    + *
    + * The hash value for an expression depends on its type:
    + *  - null:               0
    + *  - boolean:            0 for true, 1 for false.
    + *  - byte, short, int:   the input itself.
    + *  - long:               input XOR (input >>> 32)
    + *  - float:              java.lang.Float.floatToIntBits(input)
    + *  - double:             l = java.lang.Double.doubleToLongBits(input); l XOR (l >>> 32)
    + *  - binary:             java.util.Arrays.hashCode(input)
    + *  - array:              recursively calculate hash value for each element, and aggregate them by
    + *                        `result = result * 37 + elementHash` with an initial value `result = 37`.
    + *  - map:                recursively calculate hash value for each key-value pair, and aggregate
    + *                        them by `result += keyHash XOR valueHash`.
    + *  - struct:             similar to array, calculate hash value for each field and aggregate them.
    + *  - other type:         input.hashCode().
    + *                        e.g. calculate hash value for string type by `UTF8String.hashCode()`.
    + * Finally we aggregate the hash values for each expression by the same way of array.
    + *
    + * This hash algorithm is basically same with `GenericInternalRow.hashCode`, but using this hash
    + * expression is better as it can produce consistent hash values between safe and unsafe data
    + * structure, and can be slightly faster by codegen.
    + * It's also the hash function for both shuffle and bucketing, so that we can guarantee shuffle and
    + * bucketing have same data distribution.
    + */
    +case class Hash(children: Seq[Expression]) extends Expression {
    --- End diff --
    
    good point!
    after decided to not follow hive, I agree Mumur3Hash is a better choice.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-168134068
  
    **[Test build #48542 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48542/consoleFull)** for PR 10435 at commit [`b95e64e`](https://github.com/apache/spark/commit/b95e64ee0f539770534fe302321e677e3fad4a7e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167452745
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48356/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48301323
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala ---
    @@ -161,3 +163,197 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp
         })
       }
     }
    +
    +/**
    + * A function that calculates hash value for a group of expressions.
    + */
    +case class Hash(children: Seq[Expression]) extends Expression {
    +
    +  override def dataType: DataType = IntegerType
    +
    +  override def foldable: Boolean = children.forall(_.foldable)
    +
    +  override def nullable: Boolean = false
    +
    +  override def eval(input: InternalRow): Any = {
    +    var result = 0
    +    for (e <- children) {
    +      val hashValue = computeHash(e.eval(input), e.dataType)
    +      result = result * 31 + hashValue
    +    }
    +    result
    +  }
    +
    +  private def computeHash(v: Any, dataType: DataType): Int = v match {
    +    case null => 0
    --- End diff --
    
    Which path handles string?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167396600
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-168700418
  
    I think we should open another PR to use this hash expression in `Exchange`, as it will break a lof of tests and make it harder to review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-166888108
  
    **[Test build #48241 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48241/consoleFull)** for PR 10435 at commit [`79a5738`](https://github.com/apache/spark/commit/79a5738396e3ad0990f6599249135963f458d357).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-168072275
  
    It makes sense to still have a Hash expression (called more specifically, Mumur3Hash) that does what this patch originally intended. I think this will be a useful primitive. The underlying implementation can just use UnsafeRow.hashCode for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48500881
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala ---
    @@ -176,3 +178,221 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp
         })
       }
     }
    +
    +/**
    + * A function that calculates hash value for a group of expressions.
    + *
    + * The hash value for an expression depends on its type:
    + *  - null:               0
    + *  - boolean:            1 for true, 0 for false.
    + *  - byte, short, int:   the input itself.
    + *  - long:               input XOR (input >>> 32)
    + *  - float:              java.lang.Float.floatToIntBits(input)
    + *  - double:             l = java.lang.Double.doubleToLongBits(input); l XOR (l >>> 32)
    + *  - binary:             java.util.Arrays.hashCode(input)
    + *  - array:              recursively calculate hash value for each element, and aggregate them by
    + *                        `result = result * 31 + elementHash` with an initial value `result = 0`.
    + *  - map:                recursively calculate hash value for each key-value pair, and aggregate
    + *                        them by `result += keyHash XOR valueHash`.
    + *  - struct:             similar to array, calculate hash value for each field and aggregate them.
    + *  - other type:         input.hashCode().
    + *                        e.g. calculate hash value for string type by `UTF8String.hashCode()`.
    + * Finally we aggregate the hash values for each expression by `result = result * 31 + exprHash`.
    + *
    + * This hash algorithm follows hive's bucketing hash function, so that our bucketing function can
    + * be compatible with hive's, e.g. we can benefit from bucketing even the data source is mixed with
    + * hive tables.
    + */
    +case class Hash(children: Seq[Expression]) extends Expression {
    +
    +  override def dataType: DataType = IntegerType
    +
    +  override def foldable: Boolean = children.forall(_.foldable)
    +
    +  override def nullable: Boolean = false
    +
    +  override def eval(input: InternalRow): Any = {
    +    var result = 0
    +    for (e <- children) {
    +      val hashValue = computeHash(e.eval(input), e.dataType)
    +      result = result * 31 + hashValue
    +    }
    +    result
    +  }
    +
    +  private def computeHash(v: Any, dataType: DataType): Int = v match {
    +    case null => 0
    +    case b: Boolean => if (b) 1 else 0
    +    case b: Byte => b.toInt
    +    case s: Short => s.toInt
    +    case i: Int => i
    +    case l: Long => (l ^ (l >>> 32)).toInt
    +    case f: Float => java.lang.Float.floatToIntBits(f)
    +    case d: Double =>
    +      val b = java.lang.Double.doubleToLongBits(d)
    +      (b ^ (b >>> 32)).toInt
    +    case a: Array[Byte] => java.util.Arrays.hashCode(a)
    +    case s: UTF8String => s.toString.hashCode
    --- End diff --
    
    We have to match Hive's hashcode if we want to be able to join data Hive has bucketed with our own data.
    
    +1 to avoiding toString.  We should also avoid boxing and runtime type reflection for the hash code (which this function is doing).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167055194
  
    **[Test build #48291 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48291/consoleFull)** for PR 10435 at commit [`fcb5af9`](https://github.com/apache/spark/commit/fcb5af9b4af4d4885bb42ca11eba4b33a91ccc81).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-166691580
  
    Looks like we can also mention https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFHash.java. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-168674649
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-166692149
  
    Regarding testing this, I am wondering if we can add it to function registry. So, all queries that use `hash` will use this implementation and we can see if there is any failed test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-166901014
  
    **[Test build #48241 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48241/consoleFull)** for PR 10435 at commit [`79a5738`](https://github.com/apache/spark/commit/79a5738396e3ad0990f6599249135963f458d357).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `class JavaWordBlacklist `\n  * `class JavaDroppedWordsCounter `\n  * `case class Hash(children: Seq[Expression]) extends Expression `\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48703350
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala ---
    @@ -278,6 +278,7 @@ object FunctionRegistry {
         // misc functions
         expression[Crc32]("crc32"),
         expression[Md5]("md5"),
    +    expression[Murmur3Hash]("murmur3_hash"),
    --- End diff --
    
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48241/consoleFull
    
    most of them is testing something else but coincidently include hash expression.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48646170
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala ---
    @@ -278,6 +278,7 @@ object FunctionRegistry {
         // misc functions
         expression[Crc32]("crc32"),
         expression[Md5]("md5"),
    +    expression[Murmur3Hash]("murmur3_hash"),
    --- End diff --
    
    Here I didn't use `hash` for the name, as it will break a lot of hive compatibility tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48676826
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala ---
    @@ -278,6 +278,7 @@ object FunctionRegistry {
         // misc functions
         expression[Crc32]("crc32"),
         expression[Md5]("md5"),
    +    expression[Murmur3Hash]("murmur3_hash"),
    --- End diff --
    
    can you give me a list? i think we should consider just blacklisting them ...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167329447
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48345/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167713829
  
    Actually I have already created a pr #9883 for this long time ago...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48516332
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala ---
    @@ -176,3 +178,221 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp
         })
       }
     }
    +
    +/**
    + * A function that calculates hash value for a group of expressions.
    + *
    + * The hash value for an expression depends on its type:
    + *  - null:               0
    + *  - boolean:            1 for true, 0 for false.
    + *  - byte, short, int:   the input itself.
    + *  - long:               input XOR (input >>> 32)
    + *  - float:              java.lang.Float.floatToIntBits(input)
    + *  - double:             l = java.lang.Double.doubleToLongBits(input); l XOR (l >>> 32)
    + *  - binary:             java.util.Arrays.hashCode(input)
    + *  - array:              recursively calculate hash value for each element, and aggregate them by
    + *                        `result = result * 31 + elementHash` with an initial value `result = 0`.
    + *  - map:                recursively calculate hash value for each key-value pair, and aggregate
    + *                        them by `result += keyHash XOR valueHash`.
    + *  - struct:             similar to array, calculate hash value for each field and aggregate them.
    + *  - other type:         input.hashCode().
    + *                        e.g. calculate hash value for string type by `UTF8String.hashCode()`.
    + * Finally we aggregate the hash values for each expression by `result = result * 31 + exprHash`.
    + *
    + * This hash algorithm follows hive's bucketing hash function, so that our bucketing function can
    + * be compatible with hive's, e.g. we can benefit from bucketing even the data source is mixed with
    + * hive tables.
    + */
    +case class Hash(children: Seq[Expression]) extends Expression {
    +
    +  override def dataType: DataType = IntegerType
    +
    +  override def foldable: Boolean = children.forall(_.foldable)
    +
    +  override def nullable: Boolean = false
    +
    +  override def eval(input: InternalRow): Any = {
    +    var result = 0
    +    for (e <- children) {
    +      val hashValue = computeHash(e.eval(input), e.dataType)
    +      result = result * 31 + hashValue
    +    }
    +    result
    +  }
    +
    +  private def computeHash(v: Any, dataType: DataType): Int = v match {
    +    case null => 0
    +    case b: Boolean => if (b) 1 else 0
    +    case b: Byte => b.toInt
    +    case s: Short => s.toInt
    +    case i: Int => i
    +    case l: Long => (l ^ (l >>> 32)).toInt
    +    case f: Float => java.lang.Float.floatToIntBits(f)
    +    case d: Double =>
    +      val b = java.lang.Double.doubleToLongBits(d)
    +      (b ^ (b >>> 32)).toInt
    +    case a: Array[Byte] => java.util.Arrays.hashCode(a)
    +    case s: UTF8String => s.toString.hashCode
    --- End diff --
    
    Oh, I see.  Its probably fine to have this as a fallback.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-168614140
  
    **[Test build #48644 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48644/consoleFull)** for PR 10435 at commit [`aa57583`](https://github.com/apache/spark/commit/aa575834877bfa8856aa4cea6f6149fe08cb18b5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167251759
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48325/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48537829
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala ---
    @@ -176,3 +179,223 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp
         })
       }
     }
    +
    +/**
    + * A function that calculates hash value for a group of expressions.
    + *
    + * The hash value for an expression depends on its type:
    + *  - null:               0
    + *  - boolean:            0 for true, 1 for false.
    --- End diff --
    
    Oh sorry I forgot to update the PR description.
    According to the [discussion](https://github.com/apache/spark/pull/10435#discussion_r48482588), we decided to not follow hive for performance concerns. So here I followed the original hash code of internal row so that we don't need to fix a lot of tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-166692702
  
    It is also important to use this hash function in Exchange. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-168123375
  
    **[Test build #48540 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48540/consoleFull)** for PR 10435 at commit [`61783e7`](https://github.com/apache/spark/commit/61783e7bb9ea9c6f622d18db06dc85d71e6443a8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by tedyu <gi...@git.apache.org>.
Github user tedyu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48869485
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
    @@ -1863,6 +1863,17 @@ object functions extends LegacyFunctions {
        */
       def crc32(e: Column): Column = withExpr { Crc32(e.expr) }
     
    +  /**
    +   * Calculates the hash code of given columns, and returns the result as a int column.
    +   *
    +   * @group misc_funcs
    +   * @since 2.0
    +   */
    +  @scala.annotation.varargs
    +  def hash(col: Column, cols: Column*): Column = withExpr {
    --- End diff --
    
    You can use the following form:
    (firstarg:Int)(more:Int*)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-168125552
  
    **[Test build #48542 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48542/consoleFull)** for PR 10435 at commit [`b95e64e`](https://github.com/apache/spark/commit/b95e64ee0f539770534fe302321e677e3fad4a7e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48808255
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
    @@ -1863,6 +1863,17 @@ object functions extends LegacyFunctions {
        */
       def crc32(e: Column): Column = withExpr { Crc32(e.expr) }
     
    +  /**
    +   * Calculates the hash code of given columns, and returns the result as a int column.
    +   *
    +   * @group misc_funcs
    +   * @since 2.0
    +   */
    +  @scala.annotation.varargs
    +  def hash(col: Column, cols: Column*): Column = withExpr {
    --- End diff --
    
    the hash function should take at least one parameter, is `@scala.annotation.varargs` support this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-168876233
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167396554
  
    **[Test build #48349 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48349/consoleFull)** for PR 10435 at commit [`c130097`](https://github.com/apache/spark/commit/c130097e3f12a0d45eed2e2d5eb6682d65f11c9a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-166650029
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48203/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan closed the pull request at:

    https://github.com/apache/spark/pull/10435


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167338507
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48346/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48808153
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
    @@ -1863,6 +1863,17 @@ object functions extends LegacyFunctions {
        */
       def crc32(e: Column): Column = withExpr { Crc32(e.expr) }
     
    +  /**
    +   * Calculates the hash code of given columns, and returns the result as a int column.
    +   *
    +   * @group misc_funcs
    +   * @since 2.0
    +   */
    +  @scala.annotation.varargs
    +  def hash(col: Column, cols: Column*): Column = withExpr {
    --- End diff --
    
    this should just be  a single column vararg, rather than one followed by vararg?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48500689
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala ---
    @@ -176,3 +178,221 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp
         })
       }
     }
    +
    +/**
    + * A function that calculates hash value for a group of expressions.
    + *
    + * The hash value for an expression depends on its type:
    + *  - null:               0
    + *  - boolean:            1 for true, 0 for false.
    + *  - byte, short, int:   the input itself.
    + *  - long:               input XOR (input >>> 32)
    + *  - float:              java.lang.Float.floatToIntBits(input)
    + *  - double:             l = java.lang.Double.doubleToLongBits(input); l XOR (l >>> 32)
    + *  - binary:             java.util.Arrays.hashCode(input)
    + *  - array:              recursively calculate hash value for each element, and aggregate them by
    + *                        `result = result * 31 + elementHash` with an initial value `result = 0`.
    + *  - map:                recursively calculate hash value for each key-value pair, and aggregate
    + *                        them by `result += keyHash XOR valueHash`.
    + *  - struct:             similar to array, calculate hash value for each field and aggregate them.
    + *  - other type:         input.hashCode().
    + *                        e.g. calculate hash value for string type by `UTF8String.hashCode()`.
    + * Finally we aggregate the hash values for each expression by `result = result * 31 + exprHash`.
    + *
    + * This hash algorithm follows hive's bucketing hash function, so that our bucketing function can
    + * be compatible with hive's, e.g. we can benefit from bucketing even the data source is mixed with
    + * hive tables.
    + */
    +case class Hash(children: Seq[Expression]) extends Expression {
    +
    +  override def dataType: DataType = IntegerType
    +
    +  override def foldable: Boolean = children.forall(_.foldable)
    +
    +  override def nullable: Boolean = false
    +
    +  override def eval(input: InternalRow): Any = {
    +    var result = 0
    +    for (e <- children) {
    +      val hashValue = computeHash(e.eval(input), e.dataType)
    +      result = result * 31 + hashValue
    +    }
    +    result
    +  }
    +
    +  private def computeHash(v: Any, dataType: DataType): Int = v match {
    +    case null => 0
    +    case b: Boolean => if (b) 1 else 0
    +    case b: Byte => b.toInt
    +    case s: Short => s.toInt
    +    case i: Int => i
    +    case l: Long => (l ^ (l >>> 32)).toInt
    +    case f: Float => java.lang.Float.floatToIntBits(f)
    +    case d: Double =>
    +      val b = java.lang.Double.doubleToLongBits(d)
    +      (b ^ (b >>> 32)).toInt
    +    case a: Array[Byte] => java.util.Arrays.hashCode(a)
    +    case s: UTF8String => s.toString.hashCode
    --- End diff --
    
    Definitely don't do toString. Any particular reason why we need to match Hive's hash code?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167042286
  
    **[Test build #48281 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48281/consoleFull)** for PR 10435 at commit [`e4d7b82`](https://github.com/apache/spark/commit/e4d7b8216b2ce906275c96c0c14d908358d96ada).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `case class Hash(children: Seq[Expression]) extends Expression `\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167395236
  
    retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-166650025
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167042310
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48281/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-168134146
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-168674438
  
    **[Test build #48656 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48656/consoleFull)** for PR 10435 at commit [`2c1e963`](https://github.com/apache/spark/commit/2c1e963d5d1e0c286f968ae6278785b5d9586b11).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-168128648
  
    **[Test build #48540 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48540/consoleFull)** for PR 10435 at commit [`61783e7`](https://github.com/apache/spark/commit/61783e7bb9ea9c6f622d18db06dc85d71e6443a8).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class Murmur3Hash(children: Seq[Expression], seed: Int) extends Expression `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48279846
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala ---
    @@ -161,3 +163,197 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp
         })
       }
     }
    +
    +/**
    + * A function that calculates hash value for a group of expressions.
    + */
    --- End diff --
    
    How about we add more contents at here to explain the algorithm and mentions that we are following Hive.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167338499
  
    **[Test build #48346 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48346/consoleFull)** for PR 10435 at commit [`303b69b`](https://github.com/apache/spark/commit/303b69b243fb2ac5f79c948c0008592b8c57fc25).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `case class Hash(children: Seq[Expression]) extends Expression `\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167329445
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167730331
  
    **[Test build #48399 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48399/consoleFull)** for PR 10435 at commit [`84902bb`](https://github.com/apache/spark/commit/84902bb2b9488212e0006bf359135073b8a9e496).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-168652917
  
    **[Test build #48656 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48656/consoleFull)** for PR 10435 at commit [`2c1e963`](https://github.com/apache/spark/commit/2c1e963d5d1e0c286f968ae6278785b5d9586b11).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48315403
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala ---
    @@ -161,3 +163,197 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp
         })
       }
     }
    +
    +/**
    + * A function that calculates hash value for a group of expressions.
    + */
    +case class Hash(children: Seq[Expression]) extends Expression {
    +
    +  override def dataType: DataType = IntegerType
    +
    +  override def foldable: Boolean = children.forall(_.foldable)
    +
    +  override def nullable: Boolean = false
    +
    +  override def eval(input: InternalRow): Any = {
    +    var result = 0
    +    for (e <- children) {
    +      val hashValue = computeHash(e.eval(input), e.dataType)
    +      result = result * 31 + hashValue
    +    }
    +    result
    +  }
    +
    +  private def computeHash(v: Any, dataType: DataType): Int = v match {
    +    case null => 0
    --- End diff --
    
    The last one `case other => other.hashCode()`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/10435


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-168134147
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48542/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167747137
  
    **[Test build #48404 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48404/consoleFull)** for PR 10435 at commit [`1cdb2bc`](https://github.com/apache/spark/commit/1cdb2bcc1ede58fdd9c1e98bff4b5544b8a6e74e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167033978
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48274/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167020880
  
    **[Test build #48272 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48272/consoleFull)** for PR 10435 at commit [`f12cbb6`](https://github.com/apache/spark/commit/f12cbb6c4a9727eaaad8f7e8eca8d312ef67c5b8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167403090
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167292083
  
    **[Test build #48337 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48337/consoleFull)** for PR 10435 at commit [`f408f1f`](https://github.com/apache/spark/commit/f408f1ff0e9b4839a48dea56a17ca4755ad8f717).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `case class Hash(children: Seq[Expression]) extends Expression `\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167033952
  
    **[Test build #48274 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48274/consoleFull)** for PR 10435 at commit [`207ca84`](https://github.com/apache/spark/commit/207ca84dca5a607a82ec7ff46a209959a9cfa312).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `case class Hash(children: Seq[Expression]) extends Expression `\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48646962
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala ---
    @@ -278,6 +278,7 @@ object FunctionRegistry {
         // misc functions
         expression[Crc32]("crc32"),
         expression[Md5]("md5"),
    +    expression[Murmur3Hash]("murmur3_hash"),
    --- End diff --
    
    35 tests


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-168017519
  
    Closing, will open another PR to use `UnsafeRow.hashCode` for shuffle and fix tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167932288
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167403074
  
    **[Test build #48352 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48352/consoleFull)** for PR 10435 at commit [`a629e75`](https://github.com/apache/spark/commit/a629e754be3c087466de9f3b0aa8634e8d640ea0).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `case class Hash(children: Seq[Expression]) extends Expression `\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48502287
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala ---
    @@ -176,3 +178,221 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp
         })
       }
     }
    +
    +/**
    + * A function that calculates hash value for a group of expressions.
    + *
    + * The hash value for an expression depends on its type:
    + *  - null:               0
    + *  - boolean:            1 for true, 0 for false.
    + *  - byte, short, int:   the input itself.
    + *  - long:               input XOR (input >>> 32)
    + *  - float:              java.lang.Float.floatToIntBits(input)
    + *  - double:             l = java.lang.Double.doubleToLongBits(input); l XOR (l >>> 32)
    + *  - binary:             java.util.Arrays.hashCode(input)
    + *  - array:              recursively calculate hash value for each element, and aggregate them by
    + *                        `result = result * 31 + elementHash` with an initial value `result = 0`.
    + *  - map:                recursively calculate hash value for each key-value pair, and aggregate
    + *                        them by `result += keyHash XOR valueHash`.
    + *  - struct:             similar to array, calculate hash value for each field and aggregate them.
    + *  - other type:         input.hashCode().
    + *                        e.g. calculate hash value for string type by `UTF8String.hashCode()`.
    + * Finally we aggregate the hash values for each expression by `result = result * 31 + exprHash`.
    + *
    + * This hash algorithm follows hive's bucketing hash function, so that our bucketing function can
    + * be compatible with hive's, e.g. we can benefit from bucketing even the data source is mixed with
    + * hive tables.
    + */
    +case class Hash(children: Seq[Expression]) extends Expression {
    +
    +  override def dataType: DataType = IntegerType
    +
    +  override def foldable: Boolean = children.forall(_.foldable)
    +
    +  override def nullable: Boolean = false
    +
    +  override def eval(input: InternalRow): Any = {
    +    var result = 0
    +    for (e <- children) {
    +      val hashValue = computeHash(e.eval(input), e.dataType)
    +      result = result * 31 + hashValue
    +    }
    +    result
    +  }
    +
    +  private def computeHash(v: Any, dataType: DataType): Int = v match {
    +    case null => 0
    +    case b: Boolean => if (b) 1 else 0
    +    case b: Byte => b.toInt
    +    case s: Short => s.toInt
    +    case i: Int => i
    +    case l: Long => (l ^ (l >>> 32)).toInt
    +    case f: Float => java.lang.Float.floatToIntBits(f)
    +    case d: Double =>
    +      val b = java.lang.Double.doubleToLongBits(d)
    +      (b ^ (b >>> 32)).toInt
    +    case a: Array[Byte] => java.util.Arrays.hashCode(a)
    +    case s: UTF8String => s.toString.hashCode
    --- End diff --
    
    It seems to me we don't need to follow Hive's hash code, but we should make sure we don't pick the plan if it's a table bucketed by HIve.
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167036143
  
    **[Test build #48281 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48281/consoleFull)** for PR 10435 at commit [`e4d7b82`](https://github.com/apache/spark/commit/e4d7b8216b2ce906275c96c0c14d908358d96ada).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48317407
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala ---
    @@ -161,3 +163,197 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp
         })
       }
     }
    +
    +/**
    + * A function that calculates hash value for a group of expressions.
    + */
    +case class Hash(children: Seq[Expression]) extends Expression {
    --- End diff --
    
    I'm afraid we can't, as they are different. e.g. we use `31` instead of `37` here, we use `1` for boolean `true` instead of `0`here, etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-166901054
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48241/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167396685
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167918200
  
    **[Test build #48442 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48442/consoleFull)** for PR 10435 at commit [`8703b1a`](https://github.com/apache/spark/commit/8703b1a127235c49614d326334548f125b81383b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167747467
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48404/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167730354
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48399/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167396598
  
    **[Test build #48349 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48349/consoleFull)** for PR 10435 at commit [`c130097`](https://github.com/apache/spark/commit/c130097e3f12a0d45eed2e2d5eb6682d65f11c9a).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `case class Hash(children: Seq[Expression]) extends Expression `\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167033099
  
    **[Test build #48272 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48272/consoleFull)** for PR 10435 at commit [`f12cbb6`](https://github.com/apache/spark/commit/f12cbb6c4a9727eaaad8f7e8eca8d312ef67c5b8).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-168614902
  
    **[Test build #48644 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48644/consoleFull)** for PR 10435 at commit [`aa57583`](https://github.com/apache/spark/commit/aa575834877bfa8856aa4cea6f6149fe08cb18b5).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by cloud-fan <gi...@git.apache.org>.
GitHub user cloud-fan reopened a pull request:

    https://github.com/apache/spark/pull/10435

    [SPARK-12480][SQL] add Hash expression that can calculate hash value for a group of expressions

    The hash algorithm is based on https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/rows.scala#L130-L158
    Also use this expression in `Exchange`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark hash-expr

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10435.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10435
    
----
commit c8b0ea65b147c898467c00d551455abae74eddf0
Author: Wenchen Fan <we...@databricks.com>
Date:   2015-12-22T15:22:59Z

    add hash expression

commit 8a89287ed42aadde3d51d95c389ddb1e98c3ccf2
Author: Wenchen Fan <we...@databricks.com>
Date:   2015-12-23T11:48:37Z

    address comments

commit 1cdb2bcc1ede58fdd9c1e98bff4b5544b8a6e74e
Author: Wenchen Fan <we...@databricks.com>
Date:   2015-12-24T01:28:54Z

    fix hash code

commit 8703b1a127235c49614d326334548f125b81383b
Author: Wenchen Fan <we...@databricks.com>
Date:   2015-12-30T01:58:54Z

    add comment to explain the benefit of hash expression

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167932291
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48442/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48280175
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala ---
    @@ -161,3 +163,197 @@ case class Crc32(child: Expression) extends UnaryExpression with ImplicitCastInp
         })
       }
     }
    +
    +/**
    + * A function that calculates hash value for a group of expressions.
    + */
    --- End diff --
    
    oh, it is also good to mention the compatibility benefit of following Hive at here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-166650019
  
    **[Test build #48203 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48203/consoleFull)** for PR 10435 at commit [`53b0ec5`](https://github.com/apache/spark/commit/53b0ec57947130b56d374708b03cb706ab21c0ae).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `case class Hash(children: Seq[Expression]) extends Expression `\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167396643
  
    **[Test build #48350 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48350/consoleFull)** for PR 10435 at commit [`c130097`](https://github.com/apache/spark/commit/c130097e3f12a0d45eed2e2d5eb6682d65f11c9a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-166649675
  
    **[Test build #48203 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48203/consoleFull)** for PR 10435 at commit [`53b0ec5`](https://github.com/apache/spark/commit/53b0ec57947130b56d374708b03cb706ab21c0ae).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10435#discussion_r48646861
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala ---
    @@ -278,6 +278,7 @@ object FunctionRegistry {
         // misc functions
         expression[Crc32]("crc32"),
         expression[Md5]("md5"),
    +    expression[Murmur3Hash]("murmur3_hash"),
    --- End diff --
    
    How many does it break?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167246248
  
    **[Test build #48325 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48325/consoleFull)** for PR 10435 at commit [`04a7301`](https://github.com/apache/spark/commit/04a730154283a6125f05ed984115adf2e455ac60).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167292150
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-168877190
  
    I've merged this. You can address the API comment in the next pull request. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-167447980
  
    **[Test build #48356 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48356/consoleFull)** for PR 10435 at commit [`655800c`](https://github.com/apache/spark/commit/655800cd72e5fabc256cd64a82f6ccd60b491f92).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-168674651
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48656/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12480][SQL] add Hash expression that ca...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10435#issuecomment-168614909
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48644/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org