You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by wzhfy <gi...@git.apache.org> on 2017/10/16 09:01:17 UTC

[GitHub] spark pull request #19506: [SPARK-22285] [SQL] Change implementation of Appr...

GitHub user wzhfy opened a pull request:

    https://github.com/apache/spark/pull/19506

    [SPARK-22285] [SQL] Change implementation of ApproxCountDistinctForIntervals to TypedImperativeAggregate

    ## What changes were proposed in this pull request?
    
    The current implementation of `ApproxCountDistinctForIntervals` is `ImperativeAggregate`. The number of `aggBufferAttributes` is the number of total words in the hllppHelper array. Each hllppHelper has 52 words by default relativeSD.
    
    Since this aggregate function is used in equi-height histogram generation, and the number of buckets in histogram is usually hundreds, the number of `aggBufferAttributes` can easily reach tens of thousands or even more.
    
    This leads to a huge method in codegen and causes errors such as `org.codehaus.janino.JaninoRuntimeException: Code of method "apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB`. Besides, huge generated methods also result in performance regression.
    
    In this PR, we change its implementation to `TypedImperativeAggregate`. After the fix, `ApproxCountDistinctForIntervals` can deal with more than thousands endpoints without throwing codegen error, and improve performance from `20 sec` to `2 sec` in a test case of 500 endpoints.
    
    ## How was this patch tested?
    
    Test by an added test case and existing tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wzhfy/spark change_forIntervals_typedAgg

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19506.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19506
    
----
commit f66239469f2be030f23d86ff8686b59e99033b6c
Author: Zhenhua Wang <wa...@huawei.com>
Date:   2017-10-13T07:04:22Z

    implement ApproxCountDistinctForIntervals as TypedImperativeAggregate

commit 1c3e18ada5870b5d2e2a3ecb54ae2312f27af09c
Author: Zhenhua Wang <wa...@huawei.com>
Date:   2017-10-14T03:34:44Z

    remove offset

commit 792b58a2a473a899025311d37afc788db087c68e
Author: Zhenhua Wang <wa...@huawei.com>
Date:   2017-10-16T05:31:38Z

    fix withOffset return type

commit 652b301bc032be6fe66664526f3c1c316eb981b6
Author: Zhenhua Wang <wa...@huawei.com>
Date:   2017-10-16T08:27:38Z

    add test for large number of endpoints

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19506: [SPARK-22285] [SQL] Change implementation of Appr...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19506#discussion_r146151411
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala ---
    @@ -239,4 +219,26 @@ case class ApproxCountDistinctForIntervals(
       override def dataType: DataType = ArrayType(LongType)
     
       override def prettyName: String = "approx_count_distinct_for_intervals"
    +
    +  override def serialize(obj: Array[Long]): Array[Byte] = {
    +    val byteArray = new Array[Byte](obj.length * 8)
    +    obj.indices.foreach { i =>
    --- End diff --
    
    Fixed. Thanks for the reminder!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19506
  
    **[Test build #82798 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82798/testReport)** for PR 19506 at commit [`652b301`](https://github.com/apache/spark/commit/652b301bc032be6fe66664526f3c1c316eb981b6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19506: [SPARK-22285] [SQL] Change implementation of Appr...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19506#discussion_r145739279
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala ---
    @@ -46,16 +49,7 @@ case class ApproxCountDistinctForIntervals(
         relativeSD: Double = 0.05,
         mutableAggBufferOffset: Int = 0,
         inputAggBufferOffset: Int = 0)
    --- End diff --
    
    these 2 offsets are not needed anymore.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19506
  
    **[Test build #82947 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82947/testReport)** for PR 19506 at commit [`1b75428`](https://github.com/apache/spark/commit/1b754284ddba8ebe0e551580b55a8b907607c30a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19506
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82948/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19506
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19506: [SPARK-22285] [SQL] Change implementation of Appr...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19506


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19506: [SPARK-22285] [SQL] Change implementation of Appr...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19506#discussion_r146124196
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala ---
    @@ -239,4 +219,26 @@ case class ApproxCountDistinctForIntervals(
       override def dataType: DataType = ArrayType(LongType)
     
       override def prettyName: String = "approx_count_distinct_for_intervals"
    +
    +  override def serialize(obj: Array[Long]): Array[Byte] = {
    +    val byteArray = new Array[Byte](obj.length * 8)
    +    obj.indices.foreach { i =>
    +      Platform.putLong(byteArray, Platform.BYTE_ARRAY_OFFSET + i * 8, obj(i))
    +    }
    +    byteArray
    +  }
    +
    +  override def deserialize(bytes: Array[Byte]): Array[Long] = {
    +    val length = bytes.length / 8
    --- End diff --
    
    add `assert(bytes.length % 8 == 0)`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19506
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19506
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19506
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82796/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19506
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82947/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19506
  
    LGTM except a few minor comments


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19506
  
    **[Test build #82796 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82796/testReport)** for PR 19506 at commit [`652b301`](https://github.com/apache/spark/commit/652b301bc032be6fe66664526f3c1c316eb981b6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19506
  
    **[Test build #82798 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82798/testReport)** for PR 19506 at commit [`652b301`](https://github.com/apache/spark/commit/652b301bc032be6fe66664526f3c1c316eb981b6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class ApproxCountDistinctForIntervalsQuerySuite extends QueryTest with SharedSQLContext `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19506
  
    thanks, merging to master!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19506
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19506: [SPARK-22285] [SQL] Change implementation of Appr...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19506#discussion_r146124184
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala ---
    @@ -239,4 +219,26 @@ case class ApproxCountDistinctForIntervals(
       override def dataType: DataType = ArrayType(LongType)
     
       override def prettyName: String = "approx_count_distinct_for_intervals"
    +
    +  override def serialize(obj: Array[Long]): Array[Byte] = {
    +    val byteArray = new Array[Byte](obj.length * 8)
    +    obj.indices.foreach { i =>
    +      Platform.putLong(byteArray, Platform.BYTE_ARRAY_OFFSET + i * 8, obj(i))
    +    }
    +    byteArray
    +  }
    +
    +  override def deserialize(bytes: Array[Byte]): Array[Long] = {
    +    val length = bytes.length / 8
    +    val longArray = new Array[Long](length)
    +    (0 until length).foreach { i =>
    --- End diff --
    
    ditto


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19506
  
    **[Test build #82796 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82796/testReport)** for PR 19506 at commit [`652b301`](https://github.com/apache/spark/commit/652b301bc032be6fe66664526f3c1c316eb981b6).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class ApproxCountDistinctForIntervalsQuerySuite extends QueryTest with SharedSQLContext `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19506: [SPARK-22285] [SQL] Change implementation of Appr...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19506#discussion_r145741014
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala ---
    @@ -239,4 +221,23 @@ case class ApproxCountDistinctForIntervals(
       override def dataType: DataType = ArrayType(LongType)
     
       override def prettyName: String = "approx_count_distinct_for_intervals"
    +
    +  override def serialize(obj: Array[Long]): Array[Byte] = {
    +    val buffer = ByteBuffer.wrap(new Array(obj.length * Longs.BYTES))
    --- End diff --
    
    IIRC `ByteBuffer` is pretty slow for writing, shall we use unsafe writing?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19506: [SPARK-22285] [SQL] Change implementation of Appr...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19506#discussion_r146124168
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala ---
    @@ -239,4 +219,26 @@ case class ApproxCountDistinctForIntervals(
       override def dataType: DataType = ArrayType(LongType)
     
       override def prettyName: String = "approx_count_distinct_for_intervals"
    +
    +  override def serialize(obj: Array[Long]): Array[Byte] = {
    +    val byteArray = new Array[Byte](obj.length * 8)
    +    obj.indices.foreach { i =>
    --- End diff --
    
    use while look here for better performance in Scala, as this is a performance sensitive code path.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19506
  
    **[Test build #82966 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82966/testReport)** for PR 19506 at commit [`1e95a2f`](https://github.com/apache/spark/commit/1e95a2f6b7c934893b05fb396916568e8a73d523).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19506
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19506
  
    **[Test build #82929 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82929/testReport)** for PR 19506 at commit [`2d1b070`](https://github.com/apache/spark/commit/2d1b070e67625db211267fbb2b4d4c25e4da70a5).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19506
  
    **[Test build #82929 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82929/testReport)** for PR 19506 at commit [`2d1b070`](https://github.com/apache/spark/commit/2d1b070e67625db211267fbb2b4d4c25e4da70a5).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19506
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82798/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19506
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82966/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19506
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19506
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82929/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19506
  
    **[Test build #82966 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82966/testReport)** for PR 19506 at commit [`1e95a2f`](https://github.com/apache/spark/commit/1e95a2f6b7c934893b05fb396916568e8a73d523).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/19506
  
    test this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19506: [SPARK-22285] [SQL] Change implementation of Appr...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19506#discussion_r145902112
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala ---
    @@ -153,13 +129,14 @@ case class ApproxCountDistinctForIntervals(
           // endpoints are sorted into ascending order already
           if (endpoints.head > doubleValue || endpoints.last < doubleValue) {
             // ignore if the value is out of the whole range
    -        return
    +        return buffer
           }
     
           val hllppIndex = findHllppIndex(doubleValue)
    -      val offset = mutableAggBufferOffset + hllppIndex * numWordsPerHllpp
    -      hllppArray(hllppIndex).update(buffer, offset, value, child.dataType)
    +      val offset = hllppIndex * numWordsPerHllpp
    +      hllppArray(hllppIndex).update(LongArrayInput(buffer), offset, value, child.dataType)
    --- End diff --
    
    InternalRow(buffer) will copy the buffer.
    Creating a LongArrayInternalRow is a good idea, thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19506: [SPARK-22285] [SQL] Change implementation of Appr...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19506#discussion_r145741653
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala ---
    @@ -153,13 +129,14 @@ case class ApproxCountDistinctForIntervals(
           // endpoints are sorted into ascending order already
           if (endpoints.head > doubleValue || endpoints.last < doubleValue) {
             // ignore if the value is out of the whole range
    -        return
    +        return buffer
           }
     
           val hllppIndex = findHllppIndex(doubleValue)
    -      val offset = mutableAggBufferOffset + hllppIndex * numWordsPerHllpp
    -      hllppArray(hllppIndex).update(buffer, offset, value, child.dataType)
    +      val offset = hllppIndex * numWordsPerHllpp
    +      hllppArray(hllppIndex).update(LongArrayInput(buffer), offset, value, child.dataType)
    --- End diff --
    
    you can just pass `InternalRow(buffer)` here, to save a lot of code changes. If performance matters here, you can create a `LongArrayInternalRow` to avoid boxing.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19506: [SPARK-22285] [SQL] Change implementation of Appr...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19506#discussion_r146100286
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala ---
    @@ -239,4 +221,23 @@ case class ApproxCountDistinctForIntervals(
       override def dataType: DataType = ArrayType(LongType)
     
       override def prettyName: String = "approx_count_distinct_for_intervals"
    +
    +  override def serialize(obj: Array[Long]): Array[Byte] = {
    +    val buffer = ByteBuffer.wrap(new Array(obj.length * Longs.BYTES))
    --- End diff --
    
    Changed to unsafe writing, could you take another look?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19506
  
    **[Test build #82947 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82947/testReport)** for PR 19506 at commit [`1b75428`](https://github.com/apache/spark/commit/1b754284ddba8ebe0e551580b55a8b907607c30a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19506
  
    **[Test build #82948 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82948/testReport)** for PR 19506 at commit [`49b4ac2`](https://github.com/apache/spark/commit/49b4ac2ae6bd53dd529a5f1f1ea55fe5fd987192).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19506
  
    **[Test build #82948 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82948/testReport)** for PR 19506 at commit [`49b4ac2`](https://github.com/apache/spark/commit/49b4ac2ae6bd53dd529a5f1f1ea55fe5fd987192).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...

Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on the issue:

    https://github.com/apache/spark/pull/19506
  
    cc @cloud-fan 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org