You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by wzhfy <gi...@git.apache.org> on 2017/10/16 09:01:17 UTC
[GitHub] spark pull request #19506: [SPARK-22285] [SQL] Change implementation of Appr...
GitHub user wzhfy opened a pull request:
https://github.com/apache/spark/pull/19506
[SPARK-22285] [SQL] Change implementation of ApproxCountDistinctForIntervals to TypedImperativeAggregate
## What changes were proposed in this pull request?
The current implementation of `ApproxCountDistinctForIntervals` is `ImperativeAggregate`. The number of `aggBufferAttributes` is the number of total words in the hllppHelper array. Each hllppHelper has 52 words by default relativeSD.
Since this aggregate function is used in equi-height histogram generation, and the number of buckets in histogram is usually hundreds, the number of `aggBufferAttributes` can easily reach tens of thousands or even more.
This leads to a huge method in codegen and causes errors such as `org.codehaus.janino.JaninoRuntimeException: Code of method "apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB`. Besides, huge generated methods also result in performance regression.
In this PR, we change its implementation to `TypedImperativeAggregate`. After the fix, `ApproxCountDistinctForIntervals` can deal with more than thousands endpoints without throwing codegen error, and improve performance from `20 sec` to `2 sec` in a test case of 500 endpoints.
## How was this patch tested?
Test by an added test case and existing tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/wzhfy/spark change_forIntervals_typedAgg
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19506.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19506
----
commit f66239469f2be030f23d86ff8686b59e99033b6c
Author: Zhenhua Wang <wa...@huawei.com>
Date: 2017-10-13T07:04:22Z
implement ApproxCountDistinctForIntervals as TypedImperativeAggregate
commit 1c3e18ada5870b5d2e2a3ecb54ae2312f27af09c
Author: Zhenhua Wang <wa...@huawei.com>
Date: 2017-10-14T03:34:44Z
remove offset
commit 792b58a2a473a899025311d37afc788db087c68e
Author: Zhenhua Wang <wa...@huawei.com>
Date: 2017-10-16T05:31:38Z
fix withOffset return type
commit 652b301bc032be6fe66664526f3c1c316eb981b6
Author: Zhenhua Wang <wa...@huawei.com>
Date: 2017-10-16T08:27:38Z
add test for large number of endpoints
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19506: [SPARK-22285] [SQL] Change implementation of Appr...
Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:
https://github.com/apache/spark/pull/19506#discussion_r146151411
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala ---
@@ -239,4 +219,26 @@ case class ApproxCountDistinctForIntervals(
override def dataType: DataType = ArrayType(LongType)
override def prettyName: String = "approx_count_distinct_for_intervals"
+
+ override def serialize(obj: Array[Long]): Array[Byte] = {
+ val byteArray = new Array[Byte](obj.length * 8)
+ obj.indices.foreach { i =>
--- End diff --
Fixed. Thanks for the reminder!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19506
**[Test build #82798 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82798/testReport)** for PR 19506 at commit [`652b301`](https://github.com/apache/spark/commit/652b301bc032be6fe66664526f3c1c316eb981b6).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19506: [SPARK-22285] [SQL] Change implementation of Appr...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19506#discussion_r145739279
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala ---
@@ -46,16 +49,7 @@ case class ApproxCountDistinctForIntervals(
relativeSD: Double = 0.05,
mutableAggBufferOffset: Int = 0,
inputAggBufferOffset: Int = 0)
--- End diff --
these 2 offsets are not needed anymore.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19506
**[Test build #82947 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82947/testReport)** for PR 19506 at commit [`1b75428`](https://github.com/apache/spark/commit/1b754284ddba8ebe0e551580b55a8b907607c30a).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19506
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82948/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19506
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19506: [SPARK-22285] [SQL] Change implementation of Appr...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/19506
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19506: [SPARK-22285] [SQL] Change implementation of Appr...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19506#discussion_r146124196
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala ---
@@ -239,4 +219,26 @@ case class ApproxCountDistinctForIntervals(
override def dataType: DataType = ArrayType(LongType)
override def prettyName: String = "approx_count_distinct_for_intervals"
+
+ override def serialize(obj: Array[Long]): Array[Byte] = {
+ val byteArray = new Array[Byte](obj.length * 8)
+ obj.indices.foreach { i =>
+ Platform.putLong(byteArray, Platform.BYTE_ARRAY_OFFSET + i * 8, obj(i))
+ }
+ byteArray
+ }
+
+ override def deserialize(bytes: Array[Byte]): Array[Long] = {
+ val length = bytes.length / 8
--- End diff --
add `assert(bytes.length % 8 == 0)`
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19506
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19506
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19506
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82796/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19506
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82947/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19506
LGTM except a few minor comments
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19506
**[Test build #82796 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82796/testReport)** for PR 19506 at commit [`652b301`](https://github.com/apache/spark/commit/652b301bc032be6fe66664526f3c1c316eb981b6).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19506
**[Test build #82798 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82798/testReport)** for PR 19506 at commit [`652b301`](https://github.com/apache/spark/commit/652b301bc032be6fe66664526f3c1c316eb981b6).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
* `class ApproxCountDistinctForIntervalsQuerySuite extends QueryTest with SharedSQLContext `
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/19506
thanks, merging to master!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19506
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19506: [SPARK-22285] [SQL] Change implementation of Appr...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19506#discussion_r146124184
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala ---
@@ -239,4 +219,26 @@ case class ApproxCountDistinctForIntervals(
override def dataType: DataType = ArrayType(LongType)
override def prettyName: String = "approx_count_distinct_for_intervals"
+
+ override def serialize(obj: Array[Long]): Array[Byte] = {
+ val byteArray = new Array[Byte](obj.length * 8)
+ obj.indices.foreach { i =>
+ Platform.putLong(byteArray, Platform.BYTE_ARRAY_OFFSET + i * 8, obj(i))
+ }
+ byteArray
+ }
+
+ override def deserialize(bytes: Array[Byte]): Array[Long] = {
+ val length = bytes.length / 8
+ val longArray = new Array[Long](length)
+ (0 until length).foreach { i =>
--- End diff --
ditto
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19506
**[Test build #82796 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82796/testReport)** for PR 19506 at commit [`652b301`](https://github.com/apache/spark/commit/652b301bc032be6fe66664526f3c1c316eb981b6).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
* `class ApproxCountDistinctForIntervalsQuerySuite extends QueryTest with SharedSQLContext `
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19506: [SPARK-22285] [SQL] Change implementation of Appr...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19506#discussion_r145741014
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala ---
@@ -239,4 +221,23 @@ case class ApproxCountDistinctForIntervals(
override def dataType: DataType = ArrayType(LongType)
override def prettyName: String = "approx_count_distinct_for_intervals"
+
+ override def serialize(obj: Array[Long]): Array[Byte] = {
+ val buffer = ByteBuffer.wrap(new Array(obj.length * Longs.BYTES))
--- End diff --
IIRC `ByteBuffer` is pretty slow for writing, shall we use unsafe writing?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19506: [SPARK-22285] [SQL] Change implementation of Appr...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19506#discussion_r146124168
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala ---
@@ -239,4 +219,26 @@ case class ApproxCountDistinctForIntervals(
override def dataType: DataType = ArrayType(LongType)
override def prettyName: String = "approx_count_distinct_for_intervals"
+
+ override def serialize(obj: Array[Long]): Array[Byte] = {
+ val byteArray = new Array[Byte](obj.length * 8)
+ obj.indices.foreach { i =>
--- End diff --
use while look here for better performance in Scala, as this is a performance sensitive code path.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19506
**[Test build #82966 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82966/testReport)** for PR 19506 at commit [`1e95a2f`](https://github.com/apache/spark/commit/1e95a2f6b7c934893b05fb396916568e8a73d523).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19506
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19506
**[Test build #82929 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82929/testReport)** for PR 19506 at commit [`2d1b070`](https://github.com/apache/spark/commit/2d1b070e67625db211267fbb2b4d4c25e4da70a5).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19506
**[Test build #82929 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82929/testReport)** for PR 19506 at commit [`2d1b070`](https://github.com/apache/spark/commit/2d1b070e67625db211267fbb2b4d4c25e4da70a5).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19506
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82798/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19506
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82966/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19506
Merged build finished. Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19506
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82929/
Test PASSed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19506
**[Test build #82966 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82966/testReport)** for PR 19506 at commit [`1e95a2f`](https://github.com/apache/spark/commit/1e95a2f6b7c934893b05fb396916568e8a73d523).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...
Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on the issue:
https://github.com/apache/spark/pull/19506
test this please
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19506: [SPARK-22285] [SQL] Change implementation of Appr...
Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:
https://github.com/apache/spark/pull/19506#discussion_r145902112
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala ---
@@ -153,13 +129,14 @@ case class ApproxCountDistinctForIntervals(
// endpoints are sorted into ascending order already
if (endpoints.head > doubleValue || endpoints.last < doubleValue) {
// ignore if the value is out of the whole range
- return
+ return buffer
}
val hllppIndex = findHllppIndex(doubleValue)
- val offset = mutableAggBufferOffset + hllppIndex * numWordsPerHllpp
- hllppArray(hllppIndex).update(buffer, offset, value, child.dataType)
+ val offset = hllppIndex * numWordsPerHllpp
+ hllppArray(hllppIndex).update(LongArrayInput(buffer), offset, value, child.dataType)
--- End diff --
InternalRow(buffer) will copy the buffer.
Creating a LongArrayInternalRow is a good idea, thanks!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19506: [SPARK-22285] [SQL] Change implementation of Appr...
Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19506#discussion_r145741653
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala ---
@@ -153,13 +129,14 @@ case class ApproxCountDistinctForIntervals(
// endpoints are sorted into ascending order already
if (endpoints.head > doubleValue || endpoints.last < doubleValue) {
// ignore if the value is out of the whole range
- return
+ return buffer
}
val hllppIndex = findHllppIndex(doubleValue)
- val offset = mutableAggBufferOffset + hllppIndex * numWordsPerHllpp
- hllppArray(hllppIndex).update(buffer, offset, value, child.dataType)
+ val offset = hllppIndex * numWordsPerHllpp
+ hllppArray(hllppIndex).update(LongArrayInput(buffer), offset, value, child.dataType)
--- End diff --
you can just pass `InternalRow(buffer)` here, to save a lot of code changes. If performance matters here, you can create a `LongArrayInternalRow` to avoid boxing.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #19506: [SPARK-22285] [SQL] Change implementation of Appr...
Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on a diff in the pull request:
https://github.com/apache/spark/pull/19506#discussion_r146100286
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala ---
@@ -239,4 +221,23 @@ case class ApproxCountDistinctForIntervals(
override def dataType: DataType = ArrayType(LongType)
override def prettyName: String = "approx_count_distinct_for_intervals"
+
+ override def serialize(obj: Array[Long]): Array[Byte] = {
+ val buffer = ByteBuffer.wrap(new Array(obj.length * Longs.BYTES))
--- End diff --
Changed to unsafe writing, could you take another look?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19506
**[Test build #82947 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82947/testReport)** for PR 19506 at commit [`1b75428`](https://github.com/apache/spark/commit/1b754284ddba8ebe0e551580b55a8b907607c30a).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19506
**[Test build #82948 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82948/testReport)** for PR 19506 at commit [`49b4ac2`](https://github.com/apache/spark/commit/49b4ac2ae6bd53dd529a5f1f1ea55fe5fd987192).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/19506
**[Test build #82948 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82948/testReport)** for PR 19506 at commit [`49b4ac2`](https://github.com/apache/spark/commit/49b4ac2ae6bd53dd529a5f1f1ea55fe5fd987192).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #19506: [SPARK-22285] [SQL] Change implementation of ApproxCount...
Posted by wzhfy <gi...@git.apache.org>.
Github user wzhfy commented on the issue:
https://github.com/apache/spark/pull/19506
cc @cloud-fan
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org