You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "LuciferYang (via GitHub)" <gi...@apache.org> on 2023/12/31 14:04:02 UTC

[PR] [SPARK-46550][BUILD][SQL] Upgrade `datasketches-java` to 5.0.0 [spark]

LuciferYang opened a new pull request, #44547:
URL: https://github.com/apache/spark/pull/44547

   ### What changes were proposed in this pull request?
   This pr aims to upgarde `datasketches-java` from 3.3.0 to 5.0.0
   
   ### Why are the changes needed?
   The new brings some improvements, like `Major speed performance improvements for HLL union/merge operations`, the full release notes as follows:
   - https://github.com/apache/datasketches-java/releases/tag/4.0.0
   - https://github.com/apache/datasketches-java/releases/tag/4.1.0
   - https://github.com/apache/datasketches-java/releases/tag/4.2.0
   - https://github.com/apache/datasketches-java/releases/tag/5.0.0
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Pass GitHub Actions
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46550][BUILD][SQL] Upgrade `datasketches-java` to 5.0.0 [spark]

Posted by "leerho (via GitHub)" <gi...@apache.org>.
leerho commented on PR #44547:
URL: https://github.com/apache/spark/pull/44547#issuecomment-1874552923

   5.01 fixed two issues: 
   - PR 482: The Hll Union :: toString(), which prints out a simple diagnostic summary of the sketch, might change the internal state of the union.  This was not intended.  
   - PR 485: The KllItemsSketch<Boolean> was not serializing and deserializing the min and max values properly. It only affects this specific generic case of <Boolean>. 
   This is a rather bizarre use case for quantiles -- but nonetheless it is fixed!  :)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46550][BUILD][SQL] Upgrade `datasketches-java` to 5.0.0 [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on code in PR #44547:
URL: https://github.com/apache/spark/pull/44547#discussion_r1438889601


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/datasketchesAggregates.scala:
##########
@@ -315,7 +316,7 @@ case class HllUnionAgg(
             union.update(sketch)
             Some(union)
           } catch {
-            case _: java.lang.Error =>
+            case _: SketchesArgumentException | _: java.lang.Error =>

Review Comment:
   The addition of capturing `SketchesArgumentException` is to fix the test failure for `hll.sql`.
   
   ```
   diff --git a/sql/core/src/test/resources/sql-tests/results/hll.sql.out b/sql/core/src/test/resources/sql-tests/results/hll.sql.out
   index adf6c40d1bd..8f4d14fcf8d 100644
   --- a/sql/core/src/test/resources/sql-tests/results/hll.sql.out
   +++ b/sql/core/src/test/resources/sql-tests/results/hll.sql.out
   @@ -220,14 +220,8 @@ SELECT hll_sketch_estimate(CAST ('abc' AS BINARY))
    -- !query schema
    struct<>
    -- !query output
   -org.apache.spark.SparkRuntimeException
   -{
   -  "errorClass" : "HLL_INVALID_INPUT_SKETCH_BUFFER",
   -  "sqlState" : "22546",
   -  "messageParameters" : {
   -    "function" : "`hll_sketch_estimate`"
   -  }
   -}
   +org.apache.datasketches.common.SketchesArgumentException
   +Bounds Violation: reqOffset: 0, reqLength: 8, (reqOff + reqLen): 8, allocSize: 3
    
    
    -- !query
   @@ -235,14 +229,8 @@ SELECT hll_union(CAST ('abc' AS BINARY), CAST ('def' AS BINARY))
    -- !query schema
    struct<>
    -- !query output
   -org.apache.spark.SparkRuntimeException
   -{
   -  "errorClass" : "HLL_INVALID_INPUT_SKETCH_BUFFER",
   -  "sqlState" : "22546",
   -  "messageParameters" : {
   -    "function" : "`hll_union`"
   -  }
   -}
   +org.apache.datasketches.common.SketchesArgumentException
   +Bounds Violation: reqOffset: 0, reqLength: 8, (reqOff + reqLen): 8, allocSize: 3
    
    
    -- !query
   @@ -251,14 +239,8 @@ FROM (SELECT CAST('abc' AS BINARY) AS buffer)
    -- !query schema
    struct<>
    -- !query output
   -org.apache.spark.SparkRuntimeException
   -{
   -  "errorClass" : "HLL_INVALID_INPUT_SKETCH_BUFFER",
   -  "sqlState" : "22546",
   -  "messageParameters" : {
   -    "function" : "`hll_union_agg`"
   -  }
   -}
   +org.apache.datasketches.common.SketchesArgumentException
   +Bounds Violation: reqOffset: 0, reqLength: 8, (reqOff + reqLen): 8, allocSize: 3
   ```
   This is due to some changes in the implementation of the `HllUtil#checkPreamble` method, and the `Util#checkBounds` method will throw `SketchesArgumentException`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46550][BUILD][SQL] Upgrade `datasketches-java` to 5.0.0 [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on PR #44547:
URL: https://github.com/apache/spark/pull/44547#issuecomment-1873570391

   @bjornjorgensen ok, let's wait 5.0.1. Thanks ~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46550][BUILD][SQL] Upgrade `datasketches-java` to 5.0.1 [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun closed pull request #44547: [SPARK-46550][BUILD][SQL] Upgrade `datasketches-java` to 5.0.1
URL: https://github.com/apache/spark/pull/44547


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46550][BUILD][SQL] Upgrade `datasketches-java` to 5.0.1 [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on PR #44547:
URL: https://github.com/apache/spark/pull/44547#issuecomment-1876226244

   Thanks @dongjoon-hyun @leerho and @bjornjorgensen ~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46550][BUILD][SQL] Upgrade `datasketches-java` to 5.0.0 [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on code in PR #44547:
URL: https://github.com/apache/spark/pull/44547#discussion_r1438889601


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/datasketchesAggregates.scala:
##########
@@ -315,7 +316,7 @@ case class HllUnionAgg(
             union.update(sketch)
             Some(union)
           } catch {
-            case _: java.lang.Error =>
+            case _: SketchesArgumentException | _: java.lang.Error =>

Review Comment:
   The addition of capturing `SketchesArgumentException` is to fix the test failure for `hll.sql`.
   
   ```
   diff --git a/sql/core/src/test/resources/sql-tests/results/hll.sql.out b/sql/core/src/test/resources/sql-tests/results/hll.sql.out
   index adf6c40d1bd..8f4d14fcf8d 100644
   --- a/sql/core/src/test/resources/sql-tests/results/hll.sql.out
   +++ b/sql/core/src/test/resources/sql-tests/results/hll.sql.out
   @@ -220,14 +220,8 @@ SELECT hll_sketch_estimate(CAST ('abc' AS BINARY))
    -- !query schema
    struct<>
    -- !query output
   -org.apache.spark.SparkRuntimeException
   -{
   -  "errorClass" : "HLL_INVALID_INPUT_SKETCH_BUFFER",
   -  "sqlState" : "22546",
   -  "messageParameters" : {
   -    "function" : "`hll_sketch_estimate`"
   -  }
   -}
   +org.apache.datasketches.common.SketchesArgumentException
   +Bounds Violation: reqOffset: 0, reqLength: 8, (reqOff + reqLen): 8, allocSize: 3
    
    
    -- !query
   @@ -235,14 +229,8 @@ SELECT hll_union(CAST ('abc' AS BINARY), CAST ('def' AS BINARY))
    -- !query schema
    struct<>
    -- !query output
   -org.apache.spark.SparkRuntimeException
   -{
   -  "errorClass" : "HLL_INVALID_INPUT_SKETCH_BUFFER",
   -  "sqlState" : "22546",
   -  "messageParameters" : {
   -    "function" : "`hll_union`"
   -  }
   -}
   +org.apache.datasketches.common.SketchesArgumentException
   +Bounds Violation: reqOffset: 0, reqLength: 8, (reqOff + reqLen): 8, allocSize: 3
    
    
    -- !query
   @@ -251,14 +239,8 @@ FROM (SELECT CAST('abc' AS BINARY) AS buffer)
    -- !query schema
    struct<>
    -- !query output
   -org.apache.spark.SparkRuntimeException
   -{
   -  "errorClass" : "HLL_INVALID_INPUT_SKETCH_BUFFER",
   -  "sqlState" : "22546",
   -  "messageParameters" : {
   -    "function" : "`hll_union_agg`"
   -  }
   -}
   +org.apache.datasketches.common.SketchesArgumentException
   +Bounds Violation: reqOffset: 0, reqLength: 8, (reqOff + reqLen): 8, allocSize: 3
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46550][BUILD][SQL] Upgrade `datasketches-java` to 5.0.0 [spark]

Posted by "bjornjorgensen (via GitHub)" <gi...@apache.org>.
bjornjorgensen commented on PR #44547:
URL: https://github.com/apache/spark/pull/44547#issuecomment-1873341346

   should we wait for 5.0.1 https://github.com/apache/datasketches-java/issues/484 
   [it looks like the voting for it has passed.](https://lists.apache.org/thread/2odrgnt1wbsk6pkbfonr64v79lowpd85)  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org