You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/05 20:13:09 UTC

[GitHub] [arrow] bkietz opened a new pull request #10253: ARROW-12659: [C++][Compute] Support is_valid as a guarantee

bkietz opened a new pull request #10253:
URL: https://github.com/apache/arrow/pull/10253


   This allows us to take advantage of a partition expression like `f32 >= 1 and f32 <= 2 and is_valid(f32)`, which could be generated for a row group with no null values.
   
   Note this implies that `f32 >= 1 and f32 <= 2` is a guarantee that "f32 is between 1 and 2, or is null". This is probably fine vs the more pedantic `(f32 >= 1 and f32 <= 2) or is_null(f32)`, but requires a potentially problematic addendum to the contract of guarantees: guarantees do not apply to null arguments unless an explicit is_null or is_valid is included


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #10253: ARROW-12659: [C++][Compute] Support is_valid as a guarantee

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #10253:
URL: https://github.com/apache/arrow/pull/10253#discussion_r627532732



##########
File path: cpp/src/arrow/dataset/file_parquet.cc
##########
@@ -198,17 +198,22 @@ static util::optional<compute::Expression> ColumnChunkStatisticsAsExpression(
   auto maybe_min = min->CastTo(field->type());
   auto maybe_max = max->CastTo(field->type());
   if (maybe_min.ok() && maybe_max.ok()) {
-    auto col_min = maybe_min.MoveValueUnsafe();
-    auto col_max = maybe_max.MoveValueUnsafe();
-    if (col_min->Equals(col_max)) {
-      return compute::equal(std::move(field_expr), compute::literal(std::move(col_min)));
+    min = maybe_min.MoveValueUnsafe();
+    max = maybe_max.MoveValueUnsafe();
+    if (min->Equals(max)) {
+      return compute::equal(std::move(field_expr), compute::literal(std::move(min)));

Review comment:
       Do we want to add an `is_valid` here too if applicable?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on pull request #10253: ARROW-12659: [C++][Compute] Support is_valid as a guarantee

Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #10253:
URL: https://github.com/apache/arrow/pull/10253#issuecomment-833477688


   Ah. I guess we should document the semantics we expect - right now it seems we're mostly translating implicitly what people expect from Parquet/Hive which is fine, but imprecise.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #10253: ARROW-12659: [C++][Compute] Support is_valid as a guarantee

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #10253:
URL: https://github.com/apache/arrow/pull/10253#discussion_r627552774



##########
File path: cpp/src/arrow/dataset/file_parquet.cc
##########
@@ -198,17 +198,22 @@ static util::optional<compute::Expression> ColumnChunkStatisticsAsExpression(
   auto maybe_min = min->CastTo(field->type());
   auto maybe_max = max->CastTo(field->type());
   if (maybe_min.ok() && maybe_max.ok()) {
-    auto col_min = maybe_min.MoveValueUnsafe();
-    auto col_max = maybe_max.MoveValueUnsafe();
-    if (col_min->Equals(col_max)) {
-      return compute::equal(std::move(field_expr), compute::literal(std::move(col_min)));
+    min = maybe_min.MoveValueUnsafe();
+    max = maybe_max.MoveValueUnsafe();
+    if (min->Equals(max)) {
+      return compute::equal(std::move(field_expr), compute::literal(std::move(min)));

Review comment:
       Sounds good to me. I think a row group with `[2, null]` right now might actually get filtered out accidentally if your filter is `i32 != 2`, since the guarantee will simply be `i32 == 2`…




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #10253: ARROW-12659: [C++][Compute] Support is_valid as a guarantee

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #10253:
URL: https://github.com/apache/arrow/pull/10253#discussion_r627547092



##########
File path: cpp/src/arrow/dataset/file_parquet.cc
##########
@@ -198,17 +198,22 @@ static util::optional<compute::Expression> ColumnChunkStatisticsAsExpression(
   auto maybe_min = min->CastTo(field->type());
   auto maybe_max = max->CastTo(field->type());
   if (maybe_min.ok() && maybe_max.ok()) {
-    auto col_min = maybe_min.MoveValueUnsafe();
-    auto col_max = maybe_max.MoveValueUnsafe();
-    if (col_min->Equals(col_max)) {
-      return compute::equal(std::move(field_expr), compute::literal(std::move(col_min)));
+    min = maybe_min.MoveValueUnsafe();
+    max = maybe_max.MoveValueUnsafe();
+    if (min->Equals(max)) {
+      return compute::equal(std::move(field_expr), compute::literal(std::move(min)));

Review comment:
       Equality expressions will be picked up by the ExtractKnownFieldValues pass, which assumes the field will `min` and never null (questionably correct)
   
   Longer term: I think we need to be more rigorous with the guarantee contract. Specifically: any guarantee should be usable as a filter, and in that case will evaluate to an array containing only `true` (and no nulls).
   
   | Data, Guarantee | Filter | Evaluation of guarantee on Data | |
   |---|---|---|---|
   | [2, 2], i32 == 2 | i32 == 2 | [true, true] | can simplify to `true` |
   | [2, null], i32 == 2 or is_null(i32) | i32 == 2 | [true, null] | the second row should be excluded from the filter |
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on pull request #10253: ARROW-12659: [C++][Compute] Support is_valid as a guarantee

Posted by GitBox <gi...@apache.org>.
bkietz commented on pull request #10253:
URL: https://github.com/apache/arrow/pull/10253#issuecomment-833005335


   >  I think assuming nullability by default is fine
   
   where this gets semantically sticky is: we use `i32 == 64` as a guarantee that it's always 64 and never null


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on pull request #10253: ARROW-12659: [C++][Compute] Support is_valid as a guarantee

Posted by GitBox <gi...@apache.org>.
bkietz commented on pull request #10253:
URL: https://github.com/apache/arrow/pull/10253#issuecomment-881343067


   Closing for now


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz closed pull request #10253: ARROW-12659: [C++][Compute] Support is_valid as a guarantee

Posted by GitBox <gi...@apache.org>.
bkietz closed pull request #10253:
URL: https://github.com/apache/arrow/pull/10253


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10253: ARROW-12659: [C++][Compute] Support is_valid as a guarantee

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10253:
URL: https://github.com/apache/arrow/pull/10253#issuecomment-832976336


   https://issues.apache.org/jira/browse/ARROW-12659


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #10253: ARROW-12659: [C++][Compute] Support is_valid as a guarantee

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #10253:
URL: https://github.com/apache/arrow/pull/10253#discussion_r631235305



##########
File path: cpp/src/arrow/dataset/file_parquet.cc
##########
@@ -198,16 +198,27 @@ static util::optional<compute::Expression> ColumnChunkStatisticsAsExpression(
   auto maybe_min = min->CastTo(field->type());
   auto maybe_max = max->CastTo(field->type());
   if (maybe_min.ok() && maybe_max.ok()) {
-    auto col_min = maybe_min.MoveValueUnsafe();
-    auto col_max = maybe_max.MoveValueUnsafe();
-    if (col_min->Equals(col_max)) {
-      return compute::equal(std::move(field_expr), compute::literal(std::move(col_min)));
+    min = maybe_min.MoveValueUnsafe();
+    max = maybe_max.MoveValueUnsafe();
+
+    compute::Expression range;
+    if (min->Equals(max)) {
+      auto single_value = compute::equal(field_expr, compute::literal(std::move(min)));
+
+      if (statistics->null_count() == 0) {
+        return single_value;
+      }
+      return compute::or_(std::move(single_value), is_null(std::move(field_expr)));
     }
 
     auto lower_bound =
-        compute::greater_equal(field_expr, compute::literal(std::move(col_min)));
-    auto upper_bound =
-        compute::less_equal(std::move(field_expr), compute::literal(std::move(col_max)));
+        compute::greater_equal(field_expr, compute::literal(std::move(min)));
+    auto upper_bound = compute::less_equal(field_expr, compute::literal(std::move(max)));
+
+    if (statistics->null_count() == 0) {

Review comment:
       whoops

##########
File path: cpp/src/arrow/compute/exec/expression_test.cc
##########
@@ -64,6 +64,12 @@ Expression cast(Expression argument, std::shared_ptr<DataType> to_type) {
               compute::CastOptions::Safe(std::move(to_type)));
 }
 
+Expression invert(Expression argument) { return call("invert", {std::move(argument)}); }

Review comment:
       yes, forgot about that factory




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #10253: ARROW-12659: [C++][Compute] Support is_valid as a guarantee

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #10253:
URL: https://github.com/apache/arrow/pull/10253#discussion_r629645983



##########
File path: cpp/src/arrow/dataset/file_parquet.cc
##########
@@ -198,16 +198,27 @@ static util::optional<compute::Expression> ColumnChunkStatisticsAsExpression(
   auto maybe_min = min->CastTo(field->type());
   auto maybe_max = max->CastTo(field->type());
   if (maybe_min.ok() && maybe_max.ok()) {
-    auto col_min = maybe_min.MoveValueUnsafe();
-    auto col_max = maybe_max.MoveValueUnsafe();
-    if (col_min->Equals(col_max)) {
-      return compute::equal(std::move(field_expr), compute::literal(std::move(col_min)));
+    min = maybe_min.MoveValueUnsafe();
+    max = maybe_max.MoveValueUnsafe();
+
+    compute::Expression range;
+    if (min->Equals(max)) {
+      auto single_value = compute::equal(field_expr, compute::literal(std::move(min)));
+
+      if (statistics->null_count() == 0) {
+        return single_value;
+      }
+      return compute::or_(std::move(single_value), is_null(std::move(field_expr)));
     }
 
     auto lower_bound =
-        compute::greater_equal(field_expr, compute::literal(std::move(col_min)));
-    auto upper_bound =
-        compute::less_equal(std::move(field_expr), compute::literal(std::move(col_max)));
+        compute::greater_equal(field_expr, compute::literal(std::move(min)));
+    auto upper_bound = compute::less_equal(field_expr, compute::literal(std::move(max)));
+
+    if (statistics->null_count() == 0) {

Review comment:
       Was this meant to be `!= 0`?

##########
File path: cpp/src/arrow/compute/exec/expression_test.cc
##########
@@ -64,6 +64,12 @@ Expression cast(Expression argument, std::shared_ptr<DataType> to_type) {
               compute::CastOptions::Safe(std::move(to_type)));
 }
 
+Expression invert(Expression argument) { return call("invert", {std::move(argument)}); }

Review comment:
       Isn't this `compute::not_`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org