You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by cloud-fan <gi...@git.apache.org> on 2016/01/11 10:20:48 UTC

[GitHub] spark pull request: [SPARK-12642][SQL] improve the hash expression...

GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/10694

    [SPARK-12642][SQL] improve the hash expression to be decoupled from unsafe row

    https://issues.apache.org/jira/browse/SPARK-12642

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark hash-expr

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10694.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10694
    
----
commit ae0349289209f1d53d5e4fef7c34b2a50c2b8e14
Author: Wenchen Fan <we...@databricks.com>
Date:   2016-01-11T09:17:49Z

    improve the hash expression to be decoupled from unsafe row

commit ec1247f72cec0504139564cec13536285a6126ea
Author: Wenchen Fan <we...@databricks.com>
Date:   2016-01-11T09:20:21Z

    Merge remote-tracking branch 'origin/master' into hash-expr

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12642][SQL] improve the hash expression...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10694#issuecomment-171108085
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12642][SQL] improve the hash expression...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10694#issuecomment-170694166
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12642][SQL] improve the hash expression...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10694#discussion_r49644273
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala ---
    @@ -206,22 +232,225 @@ case class Murmur3Hash(children: Seq[Expression], seed: Int) extends Expression
         }
       }
     
    -  private lazy val unsafeProjection = UnsafeProjection.create(children)
    +  override def prettyName: String = "hash"
    +
    +  override def sql: String = s"$prettyName(${children.map(_.sql).mkString(", ")}, $seed)"
     
       override def eval(input: InternalRow): Any = {
    -    unsafeProjection(input).hashCode(seed)
    +    var hash = seed
    +    var i = 0
    +    val len = children.length
    +    while (i < len) {
    +      hash = computeHash(children(i).eval(input), children(i).dataType, hash)
    +      i += 1
    +    }
    +    hash
       }
     
    +  private def computeHash(value: Any, dataType: DataType, seed: Int): Int = {
    +    def hashInt(i: Int): Int = Murmur3_x86_32.hashInt(i, seed)
    +    def hashLong(l: Long): Int = Murmur3_x86_32.hashLong(l, seed)
    +
    +    value match {
    +      case null => seed
    +      case b: Boolean => hashInt(if (b) 1 else 0)
    +      case b: Byte => hashInt(b)
    +      case s: Short => hashInt(s)
    +      case i: Int => hashInt(i)
    +      case l: Long => hashLong(l)
    +      case f: Float => hashInt(java.lang.Float.floatToIntBits(f))
    +      case d: Double => hashLong(java.lang.Double.doubleToLongBits(d))
    +      case d: Decimal =>
    +        val precision = dataType.asInstanceOf[DecimalType].precision
    +        if (precision <= Decimal.MAX_LONG_DIGITS) {
    +          hashLong(d.toUnscaledLong)
    +        } else {
    +          val bytes = d.toJavaBigDecimal.unscaledValue().toByteArray
    +          Murmur3_x86_32.hashUnsafeBytes(bytes, Platform.BYTE_ARRAY_OFFSET, bytes.length, seed)
    +        }
    +      case c: CalendarInterval => Murmur3_x86_32.hashInt(c.months, hashLong(c.microseconds))
    +      case a: Array[Byte] =>
    +        Murmur3_x86_32.hashUnsafeBytes(a, Platform.BYTE_ARRAY_OFFSET, a.length, seed)
    +      case s: UTF8String =>
    +        Murmur3_x86_32.hashUnsafeBytes(s.getBaseObject, s.getBaseOffset, s.numBytes(), seed)
    +
    +      case array: ArrayData =>
    +        val elementType = dataType match {
    +          case udt: UserDefinedType[_] => udt.sqlType.asInstanceOf[ArrayType].elementType
    +          case ArrayType(et, _) => et
    +        }
    +        var result = seed
    +        var i = 0
    +        while (i < array.numElements()) {
    +          result = computeHash(array.get(i, elementType), elementType, result)
    +          i += 1
    +        }
    +        result
    +
    +      case map: MapData =>
    +        val (kt, vt) = dataType match {
    +          case udt: UserDefinedType[_] =>
    +            val mapType = udt.sqlType.asInstanceOf[MapType]
    +            mapType.keyType -> mapType.valueType
    +          case MapType(kt, vt, _) => kt -> vt
    +        }
    +        val keys = map.keyArray()
    +        val values = map.valueArray()
    +        var result = seed
    +        var i = 0
    +        while (i < map.numElements()) {
    +          result = computeHash(keys.get(i, kt), kt, result)
    +          result = computeHash(values.get(i, vt), vt, result)
    +          i += 1
    +        }
    +        result
    +
    +      case struct: InternalRow =>
    --- End diff --
    
    Actually it's possible to abstract it and reduce duplicated code, I'll do it in follow-up PR


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12642][SQL] improve the hash expression...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10694#issuecomment-170487715
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12642][SQL] improve the hash expression...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10694#issuecomment-170489792
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49129/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12642][SQL] improve the hash expression...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10694#issuecomment-170688753
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49162/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12642][SQL] improve the hash expression...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10694#issuecomment-170486837
  
    **[Test build #49129 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49129/consoleFull)** for PR 10694 at commit [`d3ff3a3`](https://github.com/apache/spark/commit/d3ff3a3bea8b4fa48eb82182965e01ac4ed2fd5f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12642][SQL] improve the hash expression...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10694#discussion_r49402971
  
    --- Diff: unsafe/src/main/java/org/apache/spark/unsafe/hash/Murmur3_x86_32.java ---
    @@ -51,16 +55,38 @@ public int hashUnsafeWords(Object base, long offset, int lengthInBytes) {
       public static int hashUnsafeWords(Object base, long offset, int lengthInBytes, int seed) {
         // This is based on Guava's `Murmur32_Hasher.processRemaining(ByteBuffer)` method.
         assert (lengthInBytes % 8 == 0): "lengthInBytes must be a multiple of 8 (word-aligned)";
    +    int h1 = hashBytesByInt(base, offset, lengthInBytes, seed);
    +    return fmix(h1, lengthInBytes);
    +  }
    +
    +  public static int hashUnsafeBytes(Object base, long offset, int lengthInBytes, int seed) {
    --- End diff --
    
    what's the purpose of this one vs the other ones? Can we unify? I can't imagine someone caring if we hash byte by byte vs hashing the initial bytes in large chunks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12642][SQL] improve the hash expression...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10694#issuecomment-170694168
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49169/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12642][SQL] improve the hash expression...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10694#issuecomment-171074701
  
    **[Test build #49264 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49264/consoleFull)** for PR 10694 at commit [`1d7d054`](https://github.com/apache/spark/commit/1d7d0542553eb1d98362bf050bf5bdd0a2cc72de).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12642][SQL] improve the hash expression...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10694#discussion_r49419833
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala ---
    @@ -206,22 +230,230 @@ case class Murmur3Hash(children: Seq[Expression], seed: Int) extends Expression
         }
       }
     
    -  private lazy val unsafeProjection = UnsafeProjection.create(children)
    +  override def prettyName: String = "hash"
    +
    +  override def sql: String = s"$prettyName(${children.map(_.sql).mkString(", ")}, $seed)"
     
       override def eval(input: InternalRow): Any = {
    -    unsafeProjection(input).hashCode(seed)
    +    var hash = seed
    +    var i = 0
    +    val len = children.length
    +    while (i < len) {
    +      hash = computeHash(children(i).eval(input), children(i).dataType, hash)
    +      i += 1
    +    }
    +    hash
       }
     
    +  private def computeHash(value: Any, dataType: DataType, seed: Int): Int = {
    +    def hashInt(i: Int): Int = Murmur3_x86_32.hashInt(i, seed)
    +    def hashLong(l: Long): Int = Murmur3_x86_32.hashLong(l, seed)
    +
    +    value match {
    +      case null => seed
    +      case b: Boolean => hashInt(if (b) 1 else 0)
    +      case b: Byte => hashInt(b)
    +      case s: Short => hashInt(s)
    +      case i: Int => hashInt(i)
    +      case l: Long => hashLong(l)
    +      case f: Float => hashInt(java.lang.Float.floatToIntBits(f))
    +      case d: Double => hashLong(java.lang.Double.doubleToLongBits(d))
    +      case d: Decimal =>
    +        val precision = dataType.asInstanceOf[DecimalType].precision
    +        if (precision <= Decimal.MAX_LONG_DIGITS) {
    +          hashLong(d.toUnscaledLong)
    +        } else {
    +          val bytes = d.toJavaBigDecimal.unscaledValue().toByteArray
    +          Murmur3_x86_32.hashUnsafeBytes(bytes, Platform.BYTE_ARRAY_OFFSET, bytes.length, seed)
    +        }
    +      case c: CalendarInterval => hashInt(c.months) ^ hashLong(c.microseconds)
    +      case a: Array[Byte] =>
    +        Murmur3_x86_32.hashUnsafeBytes(a, Platform.BYTE_ARRAY_OFFSET, a.length, seed)
    +      case s: UTF8String =>
    +        Murmur3_x86_32.hashUnsafeBytes(s.getBaseObject, s.getBaseOffset, s.numBytes(), seed)
    +
    +      case array: ArrayData =>
    +        val elementType = dataType match {
    +          case udt: UserDefinedType[_] => udt.sqlType.asInstanceOf[ArrayType].elementType
    +          case ArrayType(et, _) => et
    +        }
    +        var result = 0
    +        var i = 0
    +        while (i < array.numElements()) {
    +          val hashValue = computeHash(array.get(i, elementType), elementType, seed)
    --- End diff --
    
    I think this is better as
    
    var result = seed
    while () 
      result = computeHash(..., result)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12642][SQL] improve the hash expression...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10694#discussion_r49643209
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala ---
    @@ -206,22 +232,225 @@ case class Murmur3Hash(children: Seq[Expression], seed: Int) extends Expression
         }
       }
     
    -  private lazy val unsafeProjection = UnsafeProjection.create(children)
    +  override def prettyName: String = "hash"
    +
    +  override def sql: String = s"$prettyName(${children.map(_.sql).mkString(", ")}, $seed)"
     
       override def eval(input: InternalRow): Any = {
    -    unsafeProjection(input).hashCode(seed)
    +    var hash = seed
    +    var i = 0
    +    val len = children.length
    +    while (i < len) {
    +      hash = computeHash(children(i).eval(input), children(i).dataType, hash)
    +      i += 1
    +    }
    +    hash
       }
     
    +  private def computeHash(value: Any, dataType: DataType, seed: Int): Int = {
    +    def hashInt(i: Int): Int = Murmur3_x86_32.hashInt(i, seed)
    +    def hashLong(l: Long): Int = Murmur3_x86_32.hashLong(l, seed)
    +
    +    value match {
    +      case null => seed
    +      case b: Boolean => hashInt(if (b) 1 else 0)
    +      case b: Byte => hashInt(b)
    +      case s: Short => hashInt(s)
    +      case i: Int => hashInt(i)
    +      case l: Long => hashLong(l)
    +      case f: Float => hashInt(java.lang.Float.floatToIntBits(f))
    +      case d: Double => hashLong(java.lang.Double.doubleToLongBits(d))
    +      case d: Decimal =>
    +        val precision = dataType.asInstanceOf[DecimalType].precision
    +        if (precision <= Decimal.MAX_LONG_DIGITS) {
    +          hashLong(d.toUnscaledLong)
    +        } else {
    +          val bytes = d.toJavaBigDecimal.unscaledValue().toByteArray
    +          Murmur3_x86_32.hashUnsafeBytes(bytes, Platform.BYTE_ARRAY_OFFSET, bytes.length, seed)
    +        }
    +      case c: CalendarInterval => Murmur3_x86_32.hashInt(c.months, hashLong(c.microseconds))
    +      case a: Array[Byte] =>
    +        Murmur3_x86_32.hashUnsafeBytes(a, Platform.BYTE_ARRAY_OFFSET, a.length, seed)
    +      case s: UTF8String =>
    +        Murmur3_x86_32.hashUnsafeBytes(s.getBaseObject, s.getBaseOffset, s.numBytes(), seed)
    +
    +      case array: ArrayData =>
    +        val elementType = dataType match {
    +          case udt: UserDefinedType[_] => udt.sqlType.asInstanceOf[ArrayType].elementType
    +          case ArrayType(et, _) => et
    +        }
    +        var result = seed
    +        var i = 0
    +        while (i < array.numElements()) {
    +          result = computeHash(array.get(i, elementType), elementType, result)
    +          i += 1
    +        }
    +        result
    +
    +      case map: MapData =>
    +        val (kt, vt) = dataType match {
    +          case udt: UserDefinedType[_] =>
    +            val mapType = udt.sqlType.asInstanceOf[MapType]
    +            mapType.keyType -> mapType.valueType
    +          case MapType(kt, vt, _) => kt -> vt
    +        }
    +        val keys = map.keyArray()
    +        val values = map.valueArray()
    +        var result = seed
    +        var i = 0
    +        while (i < map.numElements()) {
    +          result = computeHash(keys.get(i, kt), kt, result)
    +          result = computeHash(values.get(i, vt), vt, result)
    +          i += 1
    +        }
    +        result
    +
    +      case struct: InternalRow =>
    --- End diff --
    
    why is this different than just calling eval()


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12642][SQL] improve the hash expression...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10694#discussion_r49403219
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala ---
    @@ -206,22 +230,230 @@ case class Murmur3Hash(children: Seq[Expression], seed: Int) extends Expression
         }
       }
     
    -  private lazy val unsafeProjection = UnsafeProjection.create(children)
    +  override def prettyName: String = "hash"
    +
    +  override def sql: String = s"$prettyName(${children.map(_.sql).mkString(", ")}, $seed)"
     
       override def eval(input: InternalRow): Any = {
    -    unsafeProjection(input).hashCode(seed)
    +    var hash = seed
    +    var i = 0
    +    val len = children.length
    +    while (i < len) {
    +      hash = computeHash(children(i).eval(input), children(i).dataType, hash)
    +      i += 1
    +    }
    +    hash
       }
     
    +  private def computeHash(value: Any, dataType: DataType, seed: Int): Int = {
    +    def hashInt(i: Int): Int = Murmur3_x86_32.hashInt(i, seed)
    +    def hashLong(l: Long): Int = Murmur3_x86_32.hashLong(l, seed)
    +
    +    value match {
    +      case null => seed
    +      case b: Boolean => hashInt(if (b) 1 else 0)
    +      case b: Byte => hashInt(b)
    +      case s: Short => hashInt(s)
    +      case i: Int => hashInt(i)
    +      case l: Long => hashLong(l)
    +      case f: Float => hashInt(java.lang.Float.floatToIntBits(f))
    +      case d: Double => hashLong(java.lang.Double.doubleToLongBits(d))
    +      case d: Decimal =>
    +        val precision = dataType.asInstanceOf[DecimalType].precision
    +        if (precision <= Decimal.MAX_LONG_DIGITS) {
    +          hashLong(d.toUnscaledLong)
    +        } else {
    +          val bytes = d.toJavaBigDecimal.unscaledValue().toByteArray
    +          Murmur3_x86_32.hashUnsafeBytes(bytes, Platform.BYTE_ARRAY_OFFSET, bytes.length, seed)
    +        }
    +      case c: CalendarInterval => hashInt(c.months) ^ hashLong(c.microseconds)
    --- End diff --
    
    I don't think you should mix it this way. This should call into murmur twice:
    hashInt(c.months, hashInt(c.microseconds, seed))


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12642][SQL] improve the hash expression...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10694#issuecomment-170688430
  
    **[Test build #49162 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49162/consoleFull)** for PR 10694 at commit [`74205dc`](https://github.com/apache/spark/commit/74205dc86566add5dd3d38e0ee62612086f7d197).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12642][SQL] improve the hash expression...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10694#issuecomment-171108086
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49264/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12642][SQL] improve the hash expression...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10694#discussion_r49406007
  
    --- Diff: unsafe/src/main/java/org/apache/spark/unsafe/hash/Murmur3_x86_32.java ---
    @@ -51,16 +55,38 @@ public int hashUnsafeWords(Object base, long offset, int lengthInBytes) {
       public static int hashUnsafeWords(Object base, long offset, int lengthInBytes, int seed) {
         // This is based on Guava's `Murmur32_Hasher.processRemaining(ByteBuffer)` method.
         assert (lengthInBytes % 8 == 0): "lengthInBytes must be a multiple of 8 (word-aligned)";
    +    int h1 = hashBytesByInt(base, offset, lengthInBytes, seed);
    +    return fmix(h1, lengthInBytes);
    +  }
    +
    +  public static int hashUnsafeBytes(Object base, long offset, int lengthInBytes, int seed) {
    --- End diff --
    
    `hashUnsafeWords` requires the bytes word-aligned, and this one breaks this limitation. It's useful to hash binary type and string type.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12642][SQL] improve the hash expression...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10694#issuecomment-170489781
  
    **[Test build #49129 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49129/consoleFull)** for PR 10694 at commit [`d3ff3a3`](https://github.com/apache/spark/commit/d3ff3a3bea8b4fa48eb82182965e01ac4ed2fd5f).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12642][SQL] improve the hash expression...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on the pull request:

    https://github.com/apache/spark/pull/10694#issuecomment-171417975
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12642][SQL] improve the hash expression...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10694#issuecomment-171107849
  
    **[Test build #49264 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49264/consoleFull)** for PR 10694 at commit [`1d7d054`](https://github.com/apache/spark/commit/1d7d0542553eb1d98362bf050bf5bdd0a2cc72de).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12642][SQL] improve the hash expression...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10694#discussion_r49419795
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala ---
    @@ -206,22 +230,230 @@ case class Murmur3Hash(children: Seq[Expression], seed: Int) extends Expression
         }
       }
     
    -  private lazy val unsafeProjection = UnsafeProjection.create(children)
    +  override def prettyName: String = "hash"
    +
    +  override def sql: String = s"$prettyName(${children.map(_.sql).mkString(", ")}, $seed)"
     
       override def eval(input: InternalRow): Any = {
    -    unsafeProjection(input).hashCode(seed)
    +    var hash = seed
    +    var i = 0
    +    val len = children.length
    +    while (i < len) {
    +      hash = computeHash(children(i).eval(input), children(i).dataType, hash)
    +      i += 1
    +    }
    +    hash
       }
     
    +  private def computeHash(value: Any, dataType: DataType, seed: Int): Int = {
    +    def hashInt(i: Int): Int = Murmur3_x86_32.hashInt(i, seed)
    +    def hashLong(l: Long): Int = Murmur3_x86_32.hashLong(l, seed)
    +
    +    value match {
    +      case null => seed
    +      case b: Boolean => hashInt(if (b) 1 else 0)
    +      case b: Byte => hashInt(b)
    +      case s: Short => hashInt(s)
    +      case i: Int => hashInt(i)
    +      case l: Long => hashLong(l)
    +      case f: Float => hashInt(java.lang.Float.floatToIntBits(f))
    +      case d: Double => hashLong(java.lang.Double.doubleToLongBits(d))
    +      case d: Decimal =>
    +        val precision = dataType.asInstanceOf[DecimalType].precision
    +        if (precision <= Decimal.MAX_LONG_DIGITS) {
    +          hashLong(d.toUnscaledLong)
    +        } else {
    +          val bytes = d.toJavaBigDecimal.unscaledValue().toByteArray
    +          Murmur3_x86_32.hashUnsafeBytes(bytes, Platform.BYTE_ARRAY_OFFSET, bytes.length, seed)
    +        }
    +      case c: CalendarInterval => hashInt(c.months) ^ hashLong(c.microseconds)
    --- End diff --
    
    My general thinking is to pick a hash function that is known to be good, e.g. murmur3 and make sure what I do  when I use it doesn't have problems: I don't convert NULLs to a common value (e.g. null map to 0 would be bad) or introduce logic if I don't have to. I *think* what you have is probably okay but something similar like hashInt(c.months) ^ hashInt(c.day) would compute 0 whenever months == day. I try not to reason about this by delegating to the carefully crafted hash functions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12642][SQL] improve the hash expression...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the pull request:

    https://github.com/apache/spark/pull/10694#issuecomment-170481566
  
    cc @nongli @rxin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12642][SQL] improve the hash expression...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10694#issuecomment-170688749
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12642][SQL] improve the hash expression...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10694#issuecomment-170650639
  
    **[Test build #49162 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49162/consoleFull)** for PR 10694 at commit [`74205dc`](https://github.com/apache/spark/commit/74205dc86566add5dd3d38e0ee62612086f7d197).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12642][SQL] improve the hash expression...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/10694


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12642][SQL] improve the hash expression...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10694#discussion_r49419855
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala ---
    @@ -206,22 +230,230 @@ case class Murmur3Hash(children: Seq[Expression], seed: Int) extends Expression
         }
       }
     
    -  private lazy val unsafeProjection = UnsafeProjection.create(children)
    +  override def prettyName: String = "hash"
    +
    +  override def sql: String = s"$prettyName(${children.map(_.sql).mkString(", ")}, $seed)"
     
       override def eval(input: InternalRow): Any = {
    -    unsafeProjection(input).hashCode(seed)
    +    var hash = seed
    +    var i = 0
    +    val len = children.length
    +    while (i < len) {
    +      hash = computeHash(children(i).eval(input), children(i).dataType, hash)
    +      i += 1
    +    }
    +    hash
       }
     
    +  private def computeHash(value: Any, dataType: DataType, seed: Int): Int = {
    +    def hashInt(i: Int): Int = Murmur3_x86_32.hashInt(i, seed)
    +    def hashLong(l: Long): Int = Murmur3_x86_32.hashLong(l, seed)
    +
    +    value match {
    +      case null => seed
    +      case b: Boolean => hashInt(if (b) 1 else 0)
    +      case b: Byte => hashInt(b)
    +      case s: Short => hashInt(s)
    +      case i: Int => hashInt(i)
    +      case l: Long => hashLong(l)
    +      case f: Float => hashInt(java.lang.Float.floatToIntBits(f))
    +      case d: Double => hashLong(java.lang.Double.doubleToLongBits(d))
    +      case d: Decimal =>
    +        val precision = dataType.asInstanceOf[DecimalType].precision
    +        if (precision <= Decimal.MAX_LONG_DIGITS) {
    +          hashLong(d.toUnscaledLong)
    +        } else {
    +          val bytes = d.toJavaBigDecimal.unscaledValue().toByteArray
    +          Murmur3_x86_32.hashUnsafeBytes(bytes, Platform.BYTE_ARRAY_OFFSET, bytes.length, seed)
    +        }
    +      case c: CalendarInterval => hashInt(c.months) ^ hashLong(c.microseconds)
    +      case a: Array[Byte] =>
    +        Murmur3_x86_32.hashUnsafeBytes(a, Platform.BYTE_ARRAY_OFFSET, a.length, seed)
    +      case s: UTF8String =>
    +        Murmur3_x86_32.hashUnsafeBytes(s.getBaseObject, s.getBaseOffset, s.numBytes(), seed)
    +
    +      case array: ArrayData =>
    +        val elementType = dataType match {
    +          case udt: UserDefinedType[_] => udt.sqlType.asInstanceOf[ArrayType].elementType
    +          case ArrayType(et, _) => et
    +        }
    +        var result = 0
    +        var i = 0
    +        while (i < array.numElements()) {
    +          val hashValue = computeHash(array.get(i, elementType), elementType, seed)
    +          result = result * 31 + hashValue
    +          i += 1
    +        }
    +        result
    +
    +      case map: MapData =>
    +        val (kt, vt) = dataType match {
    +          case udt: UserDefinedType[_] =>
    +            val mapType = udt.sqlType.asInstanceOf[MapType]
    +            mapType.keyType -> mapType.valueType
    +          case MapType(kt, vt, _) => kt -> vt
    +        }
    +        val keys = map.keyArray()
    +        val values = map.valueArray()
    +        var result = 0
    +        var i = 0
    +        while (i < map.numElements()) {
    +          val keyHash = computeHash(keys.get(i, kt), kt, seed)
    +          val valueHash = computeHash(values.get(i, vt), vt, seed)
    --- End diff --
    
    Same here. Think of the hash more as the intermediate value and murmur will mix it up for you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12642][SQL] improve the hash expression...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10694#discussion_r49406251
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala ---
    @@ -206,22 +230,230 @@ case class Murmur3Hash(children: Seq[Expression], seed: Int) extends Expression
         }
       }
     
    -  private lazy val unsafeProjection = UnsafeProjection.create(children)
    +  override def prettyName: String = "hash"
    +
    +  override def sql: String = s"$prettyName(${children.map(_.sql).mkString(", ")}, $seed)"
     
       override def eval(input: InternalRow): Any = {
    -    unsafeProjection(input).hashCode(seed)
    +    var hash = seed
    +    var i = 0
    +    val len = children.length
    +    while (i < len) {
    +      hash = computeHash(children(i).eval(input), children(i).dataType, hash)
    +      i += 1
    +    }
    +    hash
       }
     
    +  private def computeHash(value: Any, dataType: DataType, seed: Int): Int = {
    +    def hashInt(i: Int): Int = Murmur3_x86_32.hashInt(i, seed)
    +    def hashLong(l: Long): Int = Murmur3_x86_32.hashLong(l, seed)
    +
    +    value match {
    +      case null => seed
    +      case b: Boolean => hashInt(if (b) 1 else 0)
    +      case b: Byte => hashInt(b)
    +      case s: Short => hashInt(s)
    +      case i: Int => hashInt(i)
    +      case l: Long => hashLong(l)
    +      case f: Float => hashInt(java.lang.Float.floatToIntBits(f))
    +      case d: Double => hashLong(java.lang.Double.doubleToLongBits(d))
    +      case d: Decimal =>
    +        val precision = dataType.asInstanceOf[DecimalType].precision
    +        if (precision <= Decimal.MAX_LONG_DIGITS) {
    +          hashLong(d.toUnscaledLong)
    +        } else {
    +          val bytes = d.toJavaBigDecimal.unscaledValue().toByteArray
    +          Murmur3_x86_32.hashUnsafeBytes(bytes, Platform.BYTE_ARRAY_OFFSET, bytes.length, seed)
    +        }
    +      case c: CalendarInterval => hashInt(c.months) ^ hashLong(c.microseconds)
    --- End diff --
    
    I haven't got the intuition to measure a hash algorithm, could you provide more information? And it'll be great if you can also take a look at array type and map type, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12642][SQL] improve the hash expression...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10694#issuecomment-170489789
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12642][SQL] improve the hash expression...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10694#discussion_r49644179
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala ---
    @@ -206,22 +232,225 @@ case class Murmur3Hash(children: Seq[Expression], seed: Int) extends Expression
         }
       }
     
    -  private lazy val unsafeProjection = UnsafeProjection.create(children)
    +  override def prettyName: String = "hash"
    +
    +  override def sql: String = s"$prettyName(${children.map(_.sql).mkString(", ")}, $seed)"
     
       override def eval(input: InternalRow): Any = {
    -    unsafeProjection(input).hashCode(seed)
    +    var hash = seed
    +    var i = 0
    +    val len = children.length
    +    while (i < len) {
    +      hash = computeHash(children(i).eval(input), children(i).dataType, hash)
    +      i += 1
    +    }
    +    hash
       }
     
    +  private def computeHash(value: Any, dataType: DataType, seed: Int): Int = {
    +    def hashInt(i: Int): Int = Murmur3_x86_32.hashInt(i, seed)
    +    def hashLong(l: Long): Int = Murmur3_x86_32.hashLong(l, seed)
    +
    +    value match {
    +      case null => seed
    +      case b: Boolean => hashInt(if (b) 1 else 0)
    +      case b: Byte => hashInt(b)
    +      case s: Short => hashInt(s)
    +      case i: Int => hashInt(i)
    +      case l: Long => hashLong(l)
    +      case f: Float => hashInt(java.lang.Float.floatToIntBits(f))
    +      case d: Double => hashLong(java.lang.Double.doubleToLongBits(d))
    +      case d: Decimal =>
    +        val precision = dataType.asInstanceOf[DecimalType].precision
    +        if (precision <= Decimal.MAX_LONG_DIGITS) {
    +          hashLong(d.toUnscaledLong)
    +        } else {
    +          val bytes = d.toJavaBigDecimal.unscaledValue().toByteArray
    +          Murmur3_x86_32.hashUnsafeBytes(bytes, Platform.BYTE_ARRAY_OFFSET, bytes.length, seed)
    +        }
    +      case c: CalendarInterval => Murmur3_x86_32.hashInt(c.months, hashLong(c.microseconds))
    +      case a: Array[Byte] =>
    +        Murmur3_x86_32.hashUnsafeBytes(a, Platform.BYTE_ARRAY_OFFSET, a.length, seed)
    +      case s: UTF8String =>
    +        Murmur3_x86_32.hashUnsafeBytes(s.getBaseObject, s.getBaseOffset, s.numBytes(), seed)
    +
    +      case array: ArrayData =>
    +        val elementType = dataType match {
    +          case udt: UserDefinedType[_] => udt.sqlType.asInstanceOf[ArrayType].elementType
    +          case ArrayType(et, _) => et
    +        }
    +        var result = seed
    +        var i = 0
    +        while (i < array.numElements()) {
    +          result = computeHash(array.get(i, elementType), elementType, result)
    +          i += 1
    +        }
    +        result
    +
    +      case map: MapData =>
    +        val (kt, vt) = dataType match {
    +          case udt: UserDefinedType[_] =>
    +            val mapType = udt.sqlType.asInstanceOf[MapType]
    +            mapType.keyType -> mapType.valueType
    +          case MapType(kt, vt, _) => kt -> vt
    +        }
    +        val keys = map.keyArray()
    +        val values = map.valueArray()
    +        var result = seed
    +        var i = 0
    +        while (i < map.numElements()) {
    +          result = computeHash(keys.get(i, kt), kt, result)
    +          result = computeHash(values.get(i, vt), vt, result)
    +          i += 1
    +        }
    +        result
    +
    +      case struct: InternalRow =>
    --- End diff --
    
    This maybe a nested struct and is different from `input` needed by this hash expression.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12642][SQL] improve the hash expression...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/10694#issuecomment-171422787
  
    I'm going to merge this. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12642][SQL] improve the hash expression...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10694#issuecomment-170487719
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49127/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org