You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "huaxingao (via GitHub)" <gi...@apache.org> on 2024/03/19 21:36:46 UTC

[PR] feat: Support covar_samp and covar_pop [arrow-datafusion-comet]

huaxingao opened a new pull request, #216:
URL: https://github.com/apache/arrow-datafusion-comet/pull/216

   ## Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #.
   
   ## Rationale for this change
   
   This PR adds the support for `covar_samp` and `covar_pop`
   
   ## What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   ## How are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   new tests
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: Support covar_samp and covar_pop [arrow-datafusion-comet]

Posted by "huaxingao (via GitHub)" <gi...@apache.org>.
huaxingao commented on code in PR #216:
URL: https://github.com/apache/arrow-datafusion-comet/pull/216#discussion_r1532930632


##########
common/src/main/scala/org/apache/spark/sql/comet/util/Utils.scala:
##########
@@ -69,6 +69,7 @@ object Utils {
     case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 2 => ShortType
     case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 4 => IntegerType
     case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 8 => LongType
+    case int: ArrowType.Int if int.getBitWidth == 8 * 8 => LongType

Review Comment:
   Shall we map to `DecimalType(20, 0)` instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: Support covar_samp and covar_pop [arrow-datafusion-comet]

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on code in PR #216:
URL: https://github.com/apache/arrow-datafusion-comet/pull/216#discussion_r1534354677


##########
common/src/main/scala/org/apache/spark/sql/comet/util/Utils.scala:
##########
@@ -69,6 +69,7 @@ object Utils {
     case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 2 => ShortType
     case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 4 => IntegerType
     case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 8 => LongType
+    case int: ArrowType.Int if int.getBitWidth == 8 * 8 => LongType

Review Comment:
   Maybe we should enable partial + final Comet aggregation as a whole, See #223.



##########
common/src/main/scala/org/apache/spark/sql/comet/util/Utils.scala:
##########
@@ -69,6 +69,7 @@ object Utils {
     case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 2 => ShortType
     case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 4 => IntegerType
     case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 8 => LongType
+    case int: ArrowType.Int if int.getBitWidth == 8 * 8 => LongType

Review Comment:
   Maybe we should enable partial + final Comet aggregation as a whole, i.e., #223.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: Support covar_samp and covar_pop [arrow-datafusion-comet]

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on code in PR #216:
URL: https://github.com/apache/arrow-datafusion-comet/pull/216#discussion_r1532921316


##########
common/src/main/scala/org/apache/spark/sql/comet/util/Utils.scala:
##########
@@ -69,6 +69,7 @@ object Utils {
     case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 2 => ShortType
     case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 4 => IntegerType
     case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 8 => LongType
+    case int: ArrowType.Int if int.getBitWidth == 8 * 8 => LongType

Review Comment:
   If we have both partial and final aggregation operators in Comet, it should be okay as Java doesn't process the intermediate results (state), but if we have only partial aggregation in Comet, this Uint64 array as LongType will possibly cause overflow in corner case, I think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: Support covar_samp and covar_pop [arrow-datafusion-comet]

Posted by "huaxingao (via GitHub)" <gi...@apache.org>.
huaxingao commented on code in PR #216:
URL: https://github.com/apache/arrow-datafusion-comet/pull/216#discussion_r1532394706


##########
common/src/main/scala/org/apache/spark/sql/comet/util/Utils.scala:
##########
@@ -69,6 +69,7 @@ object Utils {
     case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 2 => ShortType
     case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 4 => IntegerType
     case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 8 => LongType
+    case int: ArrowType.Int if int.getBitWidth == 8 * 8 => LongType

Review Comment:
   Shall we map to `DecimalType` instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: Support covar_samp and covar_pop [arrow-datafusion-comet]

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on code in PR #216:
URL: https://github.com/apache/arrow-datafusion-comet/pull/216#discussion_r1533053736


##########
common/src/main/scala/org/apache/spark/sql/comet/util/Utils.scala:
##########
@@ -69,6 +69,7 @@ object Utils {
     case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 2 => ShortType
     case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 4 => IntegerType
     case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 8 => LongType
+    case int: ArrowType.Int if int.getBitWidth == 8 * 8 => LongType

Review Comment:
   Hmm, how does it work? You will treat UInt64 array as decimal array?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: Support covar_samp and covar_pop [arrow-datafusion-comet]

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on code in PR #216:
URL: https://github.com/apache/arrow-datafusion-comet/pull/216#discussion_r1532817067


##########
common/src/main/scala/org/apache/comet/vector/NativeUtil.scala:
##########
@@ -205,7 +205,7 @@ class NativeUtil {
       case v @ (_: BitVector | _: TinyIntVector | _: SmallIntVector | _: IntVector |
           _: BigIntVector | _: Float4Vector | _: Float8Vector | _: VarCharVector |
           _: DecimalVector | _: DateDayVector | _: TimeStampMicroTZVector | _: VarBinaryVector |
-          _: FixedSizeBinaryVector | _: TimeStampMicroVector) =>
+          _: FixedSizeBinaryVector | _: TimeStampMicroVector | _: UInt8Vector) =>

Review Comment:
   I think it is `UInt64`? But what you add is `UInt8Vector`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: Support covar_samp and covar_pop [arrow-datafusion-comet]

Posted by "huaxingao (via GitHub)" <gi...@apache.org>.
huaxingao commented on PR #216:
URL: https://github.com/apache/arrow-datafusion-comet/pull/216#issuecomment-2021658430

   Seems we can't use covariance from DataFusion, because DataFusion has UInt64 for state_fields count, but Spark has Double for count. I will close this PR and implement Comet's own covariance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: Support covar_samp and covar_pop [arrow-datafusion-comet]

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on code in PR #216:
URL: https://github.com/apache/arrow-datafusion-comet/pull/216#discussion_r1532396975


##########
common/src/main/scala/org/apache/spark/sql/comet/util/Utils.scala:
##########
@@ -69,6 +69,7 @@ object Utils {
     case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 2 => ShortType
     case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 4 => IntegerType
     case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 8 => LongType
+    case int: ArrowType.Int if int.getBitWidth == 8 * 8 => LongType

Review Comment:
   Where do you use it? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: Support covar_samp and covar_pop [arrow-datafusion-comet]

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on code in PR #216:
URL: https://github.com/apache/arrow-datafusion-comet/pull/216#discussion_r1531394725


##########
common/src/main/scala/org/apache/spark/sql/comet/util/Utils.scala:
##########
@@ -69,6 +69,7 @@ object Utils {
     case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 2 => ShortType
     case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 4 => IntegerType
     case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 8 => LongType
+    case int: ArrowType.Int if int.getBitWidth == 8 * 8 => LongType

Review Comment:
   Hmm, is this UInt64? Using `LongType` to represent it will overflow, I think.



##########
common/src/main/scala/org/apache/comet/vector/NativeUtil.scala:
##########
@@ -205,7 +205,7 @@ class NativeUtil {
       case v @ (_: BitVector | _: TinyIntVector | _: SmallIntVector | _: IntVector |
           _: BigIntVector | _: Float4Vector | _: Float8Vector | _: VarCharVector |
           _: DecimalVector | _: DateDayVector | _: TimeStampMicroTZVector | _: VarBinaryVector |
-          _: FixedSizeBinaryVector | _: TimeStampMicroVector) =>
+          _: FixedSizeBinaryVector | _: TimeStampMicroVector | _: UInt8Vector) =>

Review Comment:
   Why we need to handle `UInt8Vector`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: Support covar_samp and covar_pop [arrow-datafusion-comet]

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on code in PR #216:
URL: https://github.com/apache/arrow-datafusion-comet/pull/216#discussion_r1531395957


##########
common/src/main/scala/org/apache/comet/vector/NativeUtil.scala:
##########
@@ -205,7 +205,7 @@ class NativeUtil {
       case v @ (_: BitVector | _: TinyIntVector | _: SmallIntVector | _: IntVector |
           _: BigIntVector | _: Float4Vector | _: Float8Vector | _: VarCharVector |
           _: DecimalVector | _: DateDayVector | _: TimeStampMicroTZVector | _: VarBinaryVector |
-          _: FixedSizeBinaryVector | _: TimeStampMicroVector) =>
+          _: FixedSizeBinaryVector | _: TimeStampMicroVector | _: UInt8Vector) =>

Review Comment:
   Is it used in `Covariance`/`CovariancePop` state types?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: Support covar_samp and covar_pop [arrow-datafusion-comet]

Posted by "huaxingao (via GitHub)" <gi...@apache.org>.
huaxingao commented on code in PR #216:
URL: https://github.com/apache/arrow-datafusion-comet/pull/216#discussion_r1533068315


##########
common/src/main/scala/org/apache/spark/sql/comet/util/Utils.scala:
##########
@@ -69,6 +69,7 @@ object Utils {
     case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 2 => ShortType
     case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 4 => IntegerType
     case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 8 => LongType
+    case int: ArrowType.Int if int.getBitWidth == 8 * 8 => LongType

Review Comment:
   Since the max of UInt64 is 18446744073709551615, I guess we can use `DecimalType(20, 0)` to represent the number without overflowing?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: Support covar_samp and covar_pop [arrow-datafusion-comet]

Posted by "huaxingao (via GitHub)" <gi...@apache.org>.
huaxingao commented on code in PR #216:
URL: https://github.com/apache/arrow-datafusion-comet/pull/216#discussion_r1532395943


##########
common/src/main/scala/org/apache/comet/vector/NativeUtil.scala:
##########
@@ -205,7 +205,7 @@ class NativeUtil {
       case v @ (_: BitVector | _: TinyIntVector | _: SmallIntVector | _: IntVector |
           _: BigIntVector | _: Float4Vector | _: Float8Vector | _: VarCharVector |
           _: DecimalVector | _: DateDayVector | _: TimeStampMicroTZVector | _: VarBinaryVector |
-          _: FixedSizeBinaryVector | _: TimeStampMicroVector) =>
+          _: FixedSizeBinaryVector | _: TimeStampMicroVector | _: UInt8Vector) =>

Review Comment:
   I think it's for state field [`count`](https://github.com/apache/arrow-datafusion/blob/main/datafusion/physical-expr/src/aggregate/covariance.rs#L167)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: Support covar_samp and covar_pop [arrow-datafusion-comet]

Posted by "huaxingao (via GitHub)" <gi...@apache.org>.
huaxingao commented on code in PR #216:
URL: https://github.com/apache/arrow-datafusion-comet/pull/216#discussion_r1532406374


##########
common/src/main/scala/org/apache/spark/sql/comet/util/Utils.scala:
##########
@@ -69,6 +69,7 @@ object Utils {
     case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 2 => ShortType
     case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 4 => IntegerType
     case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 8 => LongType
+    case int: ArrowType.Int if int.getBitWidth == 8 * 8 => LongType

Review Comment:
   This `UInt64` is for state field `count`. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: Support covar_samp and covar_pop [arrow-datafusion-comet]

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on code in PR #216:
URL: https://github.com/apache/arrow-datafusion-comet/pull/216#discussion_r1532918230


##########
common/src/main/scala/org/apache/comet/vector/NativeUtil.scala:
##########
@@ -205,7 +205,7 @@ class NativeUtil {
       case v @ (_: BitVector | _: TinyIntVector | _: SmallIntVector | _: IntVector |
           _: BigIntVector | _: Float4Vector | _: Float8Vector | _: VarCharVector |
           _: DecimalVector | _: DateDayVector | _: TimeStampMicroTZVector | _: VarBinaryVector |
-          _: FixedSizeBinaryVector | _: TimeStampMicroVector) =>
+          _: FixedSizeBinaryVector | _: TimeStampMicroVector | _: UInt8Vector) =>

Review Comment:
   Oh...the naming of Java Arrow API...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: Support covar_samp and covar_pop [arrow-datafusion-comet]

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on code in PR #216:
URL: https://github.com/apache/arrow-datafusion-comet/pull/216#discussion_r1533089676


##########
common/src/main/scala/org/apache/spark/sql/comet/util/Utils.scala:
##########
@@ -69,6 +69,7 @@ object Utils {
     case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 2 => ShortType
     case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 4 => IntegerType
     case int: ArrowType.Int if int.getIsSigned && int.getBitWidth == 8 * 8 => LongType
+    case int: ArrowType.Int if int.getBitWidth == 8 * 8 => LongType

Review Comment:
   Yes, but I mean Spark will treat the actual UInt64 array as decimal one. For example, it will call `getDecimal` on an UInt64 array, does it work?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: Support covar_samp and covar_pop [arrow-datafusion-comet]

Posted by "huaxingao (via GitHub)" <gi...@apache.org>.
huaxingao commented on code in PR #216:
URL: https://github.com/apache/arrow-datafusion-comet/pull/216#discussion_r1532882176


##########
common/src/main/scala/org/apache/comet/vector/NativeUtil.scala:
##########
@@ -205,7 +205,7 @@ class NativeUtil {
       case v @ (_: BitVector | _: TinyIntVector | _: SmallIntVector | _: IntVector |
           _: BigIntVector | _: Float4Vector | _: Float8Vector | _: VarCharVector |
           _: DecimalVector | _: DateDayVector | _: TimeStampMicroTZVector | _: VarBinaryVector |
-          _: FixedSizeBinaryVector | _: TimeStampMicroVector) =>
+          _: FixedSizeBinaryVector | _: TimeStampMicroVector | _: UInt8Vector) =>

Review Comment:
   Seems `UInt8Vector` is for `UInt64`. The name is confusing. https://github.com/apache/arrow/blob/main/java/vector/src/main/java/org/apache/arrow/vector/UInt8Vector.java#L37



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: Support covar_samp and covar_pop [arrow-datafusion-comet]

Posted by "huaxingao (via GitHub)" <gi...@apache.org>.
huaxingao closed pull request #216: feat: Support covar_samp and covar_pop
URL: https://github.com/apache/arrow-datafusion-comet/pull/216


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org