You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/22 12:30:01 UTC

[GitHub] [arrow] milesgranger opened a new pull request, #14697: ARROW-18265: [C++][Python] Support FieldRef to work with ListElement

milesgranger opened a new pull request, #14697:
URL: https://github.com/apache/arrow/pull/14697

   Will fix [ARROW-18265](https://issues.apache.org/jira/browse/ARROW-18265)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #14697: ARROW-18265: [C++][Python] Support FieldRef to work with ListElement

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on code in PR #14697:
URL: https://github.com/apache/arrow/pull/14697#discussion_r1038168326


##########
cpp/src/arrow/type_test.cc:
##########
@@ -434,6 +434,51 @@ TEST(TestFieldRef, DotPathRoundTrip) {
   check_roundtrip(FieldRef("foo", 1, FieldRef("bar", 2, 3), FieldRef()));
 }
 
+TEST(TestFieldRef, NestedWithList) {
+  // bit of indirection to satisfy list / large_list variants
+  auto gen_type = [](std::shared_ptr<DataType> (*f)(const std::shared_ptr<DataType>&)) {
+    return f;
+  };
+
+  for (auto list_type : {gen_type(list), gen_type(large_list)}) {
+    // Single list type
+    auto type = struct_(
+        {field("c", list_type(struct_({field("d", int32()), field("e", int8())})))});
+
+    // Note numeric values here go unused, outside of indicating a further step, during
+    // FindAll, only used by kernels for which list element to take. When FindAll
+    // encounters a list, it assumes the list value type. Not b/c they are numeric, but
+    // they are the position referencing a specific list element.
+    EXPECT_THAT(FieldRef("c", 0, "d").FindAll(*type), ElementsAre(FieldPath{0, 0, 0}));
+    EXPECT_THAT(FieldRef("c", 0, "e").FindAll(*type), ElementsAre(FieldPath{0, 0, 1}));
+    EXPECT_THAT(FieldRef("c", 1, "d").FindAll(*type), ElementsAre(FieldPath{0, 1, 0}));
+    EXPECT_THAT(FieldRef("c", 1, "e").FindAll(*type), ElementsAre(FieldPath{0, 1, 1}));

Review Comment:
   For completeness, can you also add a variant where the type starts with a list?



##########
cpp/src/arrow/type_test.cc:
##########
@@ -434,6 +434,51 @@ TEST(TestFieldRef, DotPathRoundTrip) {
   check_roundtrip(FieldRef("foo", 1, FieldRef("bar", 2, 3), FieldRef()));
 }
 
+TEST(TestFieldRef, NestedWithList) {
+  // bit of indirection to satisfy list / large_list variants
+  auto gen_type = [](std::shared_ptr<DataType> (*f)(const std::shared_ptr<DataType>&)) {
+    return f;
+  };
+
+  for (auto list_type : {gen_type(list), gen_type(large_list)}) {
+    // Single list type
+    auto type = struct_(
+        {field("c", list_type(struct_({field("d", int32()), field("e", int8())})))});
+
+    // Note numeric values here go unused, outside of indicating a further step, during
+    // FindAll, only used by kernels for which list element to take. When FindAll
+    // encounters a list, it assumes the list value type. Not b/c they are numeric, but
+    // they are the position referencing a specific list element.
+    EXPECT_THAT(FieldRef("c", 0, "d").FindAll(*type), ElementsAre(FieldPath{0, 0, 0}));
+    EXPECT_THAT(FieldRef("c", 0, "e").FindAll(*type), ElementsAre(FieldPath{0, 0, 1}));
+    EXPECT_THAT(FieldRef("c", 1, "d").FindAll(*type), ElementsAre(FieldPath{0, 1, 0}));
+    EXPECT_THAT(FieldRef("c", 1, "e").FindAll(*type), ElementsAre(FieldPath{0, 1, 1}));

Review Comment:
   Maybe also cover "failure" cases like `FieldRef("c", "non-integer", "d").FindAll(*type)`  where there is not an integer index at the location of the list level.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #14697: ARROW-18265: [C++][Python] Support FieldRef to work with ListElement

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #14697:
URL: https://github.com/apache/arrow/pull/14697#issuecomment-1348661803

   Hmm... The convenience argument is convincing, I admit.
   
   Still, I wonder if we shouldn't make it optional in FieldRef, perhaps by adding a `bool allow_list_lookup = false` argument to FieldRef lookup APIs?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] milesgranger commented on a diff in pull request #14697: ARROW-18265: [C++][Python] Support FieldRef to work with ListElement

Posted by GitBox <gi...@apache.org>.
milesgranger commented on code in PR #14697:
URL: https://github.com/apache/arrow/pull/14697#discussion_r1029403498


##########
cpp/src/arrow/type.cc:
##########
@@ -1066,12 +1066,32 @@ struct FieldPathGetImpl {
     }
 
     int depth = 0;
-    const T* out;
+    const T* out = nullptr;
     for (int index : path->indices()) {
       if (children == nullptr) {
         return Status::NotImplemented("Get child data of non-struct array");
       }
 
+      if constexpr (std::is_same_v<T, std::shared_ptr<arrow::Field>>) {
+        // For lists, we don't care about the index, jump right into the list value type.
+        // The index here is used in the kernel to grab the specific list element.
+        // Children then are fields from list type, and out will be the children from
+        // that field, thus index 0.
+        if (out != nullptr && (*out)->type()->id() == Type::LIST) {
+          children = get_children(*out);
+          index = 0;
+        } else if (out == nullptr && children->size() == 1 && index >= 0) {
+          // WIP: Perhaps a dangerous assumption?
+          // For list types when previous was not a FieldPath, but
+          // a string name referencing a list type. Then the first iteration
+          // with index (referencing the list item), might look like it's
+          // out of bounds when actually we don't care about this index for
+          // type checking, only the kernel does; we need the list value type
+          // which is now the only child in the children vector.

Review Comment:
   Example when giving a wrong path, the error makes it more apparent than what I'm trying to explain maybe:
   ```python
   >>> arr = pa.array([{'a': [{'b': 1}, {'c': 'hi'}]}])
   >>> pc.struct_field(arr, '.a[0].d')
   No match for FieldRef.Nested(FieldRef.Name(a) FieldRef.FieldPath(0) FieldRef.Name(d)) in struct<a: list<item: struct<b: int64, c: string>>>
   ```



##########
cpp/src/arrow/type.cc:
##########
@@ -1066,12 +1066,32 @@ struct FieldPathGetImpl {
     }
 
     int depth = 0;
-    const T* out;
+    const T* out = nullptr;
     for (int index : path->indices()) {
       if (children == nullptr) {
         return Status::NotImplemented("Get child data of non-struct array");
       }
 
+      if constexpr (std::is_same_v<T, std::shared_ptr<arrow::Field>>) {
+        // For lists, we don't care about the index, jump right into the list value type.
+        // The index here is used in the kernel to grab the specific list element.
+        // Children then are fields from list type, and out will be the children from
+        // that field, thus index 0.
+        if (out != nullptr && (*out)->type()->id() == Type::LIST) {
+          children = get_children(*out);
+          index = 0;
+        } else if (out == nullptr && children->size() == 1 && index >= 0) {
+          // WIP: Perhaps a dangerous assumption?
+          // For list types when previous was not a FieldPath, but
+          // a string name referencing a list type. Then the first iteration
+          // with index (referencing the list item), might look like it's
+          // out of bounds when actually we don't care about this index for
+          // type checking, only the kernel does; we need the list value type
+          // which is now the only child in the children vector.

Review Comment:
   Example when giving a wrong path, the error makes it more apparent than what I'm trying to explain maybe:
   ```python
   >>> arr = pa.array([{'a': [{'b': 1}, {'c': 'hi'}]}])
   >>> pc.struct_field(arr, '.a[0].d')
   ...
   No match for FieldRef.Nested(FieldRef.Name(a) FieldRef.FieldPath(0) FieldRef.Name(d)) in struct<a: list<item: struct<b: int64, c: string>>>
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #14697: ARROW-18265: [C++][Python] Support FieldRef to work with ListElement

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on code in PR #14697:
URL: https://github.com/apache/arrow/pull/14697#discussion_r1041161242


##########
cpp/src/arrow/compute/kernels/scalar_nested.cc:
##########
@@ -460,17 +469,21 @@ struct StructFieldFunctor {
   static Status CheckIndex(int index, const DataType& type) {
     if (!ValidParentType(type)) {
       return Status::TypeError("struct_field: cannot subscript field of type ", type);
-    } else if (index < 0 || index >= type.num_fields()) {
+    } else if (!IsBaseListType(type) && (index < 0 || index >= type.num_fields())) {

Review Comment:
   In theory, we could check the index for fixed sized list arrays as well (not for variable size list arrays, though, so not sure that is worth it)



##########
cpp/src/arrow/type.cc:
##########
@@ -1060,19 +1060,40 @@ struct FieldPathGetImpl {
   }
 
   template <typename T, typename GetChildren>
-  static Result<T> Get(const FieldPath* path, const std::vector<T>* children,
-                       GetChildren&& get_children, int* out_of_range_depth) {
+  static Result<T> Get(const FieldPath* path, const DataType* parent,
+                       const std::vector<T>* children, GetChildren&& get_children,
+                       int* out_of_range_depth) {
     if (path->indices().empty()) {
       return Status::Invalid("empty indices cannot be traversed");
     }
 
     int depth = 0;
-    const T* out;
+    const T* out = nullptr;
     for (int index : path->indices()) {
       if (children == nullptr) {
         return Status::NotImplemented("Get child data of non-struct array");
       }
 
+      if constexpr (std::is_same_v<T, std::shared_ptr<arrow::Field>>) {
+        auto IsBaseListType = [](const DataType& type) {
+          return dynamic_cast<const BaseListType*>(&type) != nullptr;
+        };
+
+        // For lists, we don't care about the index, jump right into the list value type.
+        // The index here is used in the kernel to grab the specific list element.
+        // Children then are fields from list type, and out will be the children from
+        // that field, thus index 0.
+        if (out != nullptr && IsBaseListType(*(*out)->type())) {
+          children = get_children(*out);

Review Comment:
   I don't fully understand why we need the `get_children` here (because then we call `get_children` twice in a single iteration? since there is another call a bit below) 
   Can you add a clarifying comment about that?



##########
cpp/src/arrow/type_test.cc:
##########
@@ -434,6 +434,64 @@ TEST(TestFieldRef, DotPathRoundTrip) {
   check_roundtrip(FieldRef("foo", 1, FieldRef("bar", 2, 3), FieldRef()));
 }
 
+TEST(TestFieldRef, NestedWithList) {
+  // bit of indirection to satisfy list / large_list variants
+  auto gen_type = [](std::shared_ptr<DataType> (*f)(const std::shared_ptr<DataType>&)) {
+    return f;
+  };
+
+  for (auto list_type : {gen_type(list), gen_type(large_list)}) {
+    // Single list type
+    auto type = struct_(
+        {field("c", list_type(struct_({field("d", int32()), field("e", int8())})))});
+
+    // Note numeric values here go unused, outside of indicating a further step, during
+    // FindAll, only used by kernels for which list element to take. When FindAll
+    // encounters a list, it assumes the list value type. Not b/c they are numeric, but
+    // they are the position referencing a specific list element.
+    EXPECT_THAT(FieldRef("c", 0, "d").FindAll(*type), ElementsAre(FieldPath{0, 0, 0}));

Review Comment:
   Can you also add variants of those tests that pass a schema instead of type to `FindAll`?



##########
cpp/src/arrow/compute/kernels/scalar_nested.cc:
##########
@@ -460,17 +469,21 @@ struct StructFieldFunctor {
   static Status CheckIndex(int index, const DataType& type) {
     if (!ValidParentType(type)) {
       return Status::TypeError("struct_field: cannot subscript field of type ", type);
-    } else if (index < 0 || index >= type.num_fields()) {
+    } else if (!IsBaseListType(type) && (index < 0 || index >= type.num_fields())) {
       return Status::Invalid("struct_field: out-of-bounds field reference to field ",
                              index, " in type ", type, " with ", type.num_fields(),
                              " fields");
     }
     return Status::OK();
   }
 
+  static bool IsBaseListType(const DataType& type) {
+    return dynamic_cast<const BaseListType*>(&type) != nullptr;
+  }

Review Comment:
   Also, if we keep this, using a combination of a few `type.id() == ..`  checks (like `ValidParentType` just below) might be more readable to know which types are included)



##########
cpp/src/arrow/compute/kernels/scalar_nested.cc:
##########
@@ -460,17 +469,21 @@ struct StructFieldFunctor {
   static Status CheckIndex(int index, const DataType& type) {
     if (!ValidParentType(type)) {
       return Status::TypeError("struct_field: cannot subscript field of type ", type);
-    } else if (index < 0 || index >= type.num_fields()) {
+    } else if (!IsBaseListType(type) && (index < 0 || index >= type.num_fields())) {
       return Status::Invalid("struct_field: out-of-bounds field reference to field ",
                              index, " in type ", type, " with ", type.num_fields(),
                              " fields");
     }
     return Status::OK();
   }
 
+  static bool IsBaseListType(const DataType& type) {
+    return dynamic_cast<const BaseListType*>(&type) != nullptr;
+  }

Review Comment:
   We have `is_list_like` helper defined (although that also includes Map type, not sure if the list_element kernel supports that)



##########
cpp/src/arrow/type.cc:
##########
@@ -1060,19 +1060,40 @@ struct FieldPathGetImpl {
   }
 
   template <typename T, typename GetChildren>
-  static Result<T> Get(const FieldPath* path, const std::vector<T>* children,
-                       GetChildren&& get_children, int* out_of_range_depth) {
+  static Result<T> Get(const FieldPath* path, const DataType* parent,
+                       const std::vector<T>* children, GetChildren&& get_children,
+                       int* out_of_range_depth) {
     if (path->indices().empty()) {
       return Status::Invalid("empty indices cannot be traversed");
     }
 
     int depth = 0;
-    const T* out;
+    const T* out = nullptr;
     for (int index : path->indices()) {
       if (children == nullptr) {
         return Status::NotImplemented("Get child data of non-struct array");
       }
 
+      if constexpr (std::is_same_v<T, std::shared_ptr<arrow::Field>>) {
+        auto IsBaseListType = [](const DataType& type) {
+          return dynamic_cast<const BaseListType*>(&type) != nullptr;
+        };
+
+        // For lists, we don't care about the index, jump right into the list value type.
+        // The index here is used in the kernel to grab the specific list element.
+        // Children then are fields from list type, and out will be the children from
+        // that field, thus index 0.
+        if (out != nullptr && IsBaseListType(*(*out)->type())) {
+          children = get_children(*out);

Review Comment:
   (or to put it differently: why isn't this initial get_children needed when parent is a struct type? What's the difference between both?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #14697: ARROW-18265: [C++][Python] Support FieldRef to work with ListElement

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on code in PR #14697:
URL: https://github.com/apache/arrow/pull/14697#discussion_r1029385711


##########
cpp/src/arrow/type.cc:
##########
@@ -1066,12 +1066,32 @@ struct FieldPathGetImpl {
     }
 
     int depth = 0;
-    const T* out;
+    const T* out = nullptr;
     for (int index : path->indices()) {
       if (children == nullptr) {
         return Status::NotImplemented("Get child data of non-struct array");
       }
 
+      if constexpr (std::is_same_v<T, std::shared_ptr<arrow::Field>>) {
+        // For lists, we don't care about the index, jump right into the list value type.
+        // The index here is used in the kernel to grab the specific list element.
+        // Children then are fields from list type, and out will be the children from
+        // that field, thus index 0.
+        if (out != nullptr && (*out)->type()->id() == Type::LIST) {
+          children = get_children(*out);
+          index = 0;
+        } else if (out == nullptr && children->size() == 1 && index >= 0) {
+          // WIP: Perhaps a dangerous assumption?
+          // For list types when previous was not a FieldPath, but
+          // a string name referencing a list type. Then the first iteration
+          // with index (referencing the list item), might look like it's
+          // out of bounds when actually we don't care about this index for
+          // type checking, only the kernel does; we need the list value type
+          // which is now the only child in the children vector.

Review Comment:
   But at this point we always have FieldPaths, no? (regardless whether they originally came from string names or not?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #14697: ARROW-18265: [C++][Python] Support FieldRef to work with ListElement

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #14697:
URL: https://github.com/apache/arrow/pull/14697#issuecomment-1346633393

   I find it a bit uneasy how this PR is conflating a purely metadata-based field lookup (the current semantics of `FieldRef`) and a dynamic index lookup into a list. Is this actually important to support?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #14697: ARROW-18265: [C++][Python] Support FieldRef to work with ListElement

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #14697:
URL: https://github.com/apache/arrow/pull/14697#issuecomment-1323604784

   https://issues.apache.org/jira/browse/ARROW-18265


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #14697: ARROW-18265: [C++][Python] Support FieldRef to work with ListElement

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on PR #14697:
URL: https://github.com/apache/arrow/pull/14697#issuecomment-1348537126

   And to be clear, that's also basically the question that I asked on the JIRA: "do we want to use FieldRef/FieldPath for this?" (because it does deviate from how it is currently implemented) It could also be a different (new to add) data structure (that would be used throughout the compute code instead of the current FieldRef/FieldPath)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] milesgranger commented on a diff in pull request #14697: ARROW-18265: [C++][Python] Support FieldRef to work with ListElement

Posted by GitBox <gi...@apache.org>.
milesgranger commented on code in PR #14697:
URL: https://github.com/apache/arrow/pull/14697#discussion_r1038286154


##########
cpp/src/arrow/type_test.cc:
##########
@@ -434,6 +434,51 @@ TEST(TestFieldRef, DotPathRoundTrip) {
   check_roundtrip(FieldRef("foo", 1, FieldRef("bar", 2, 3), FieldRef()));
 }
 
+TEST(TestFieldRef, NestedWithList) {
+  // bit of indirection to satisfy list / large_list variants
+  auto gen_type = [](std::shared_ptr<DataType> (*f)(const std::shared_ptr<DataType>&)) {
+    return f;
+  };
+
+  for (auto list_type : {gen_type(list), gen_type(large_list)}) {
+    // Single list type
+    auto type = struct_(
+        {field("c", list_type(struct_({field("d", int32()), field("e", int8())})))});
+
+    // Note numeric values here go unused, outside of indicating a further step, during
+    // FindAll, only used by kernels for which list element to take. When FindAll
+    // encounters a list, it assumes the list value type. Not b/c they are numeric, but
+    // they are the position referencing a specific list element.
+    EXPECT_THAT(FieldRef("c", 0, "d").FindAll(*type), ElementsAre(FieldPath{0, 0, 0}));
+    EXPECT_THAT(FieldRef("c", 0, "e").FindAll(*type), ElementsAre(FieldPath{0, 0, 1}));
+    EXPECT_THAT(FieldRef("c", 1, "d").FindAll(*type), ElementsAre(FieldPath{0, 1, 0}));
+    EXPECT_THAT(FieldRef("c", 1, "e").FindAll(*type), ElementsAre(FieldPath{0, 1, 1}));

Review Comment:
   Good call adding the variant for leading list type, it did fail when index > 0 as the parent wasn't being passed in that scenario. Works in https://github.com/apache/arrow/pull/14697/commits/c2fd9eee2fb0d7e536441394d2fd7912844ee61c but will come back later to dress it up a bit more. :) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #14697: ARROW-18265: [C++][Python] Support FieldRef to work with ListElement

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on code in PR #14697:
URL: https://github.com/apache/arrow/pull/14697#discussion_r1029361216


##########
cpp/src/arrow/type.cc:
##########
@@ -1066,12 +1066,32 @@ struct FieldPathGetImpl {
     }
 
     int depth = 0;
-    const T* out;
+    const T* out = nullptr;
     for (int index : path->indices()) {
       if (children == nullptr) {
         return Status::NotImplemented("Get child data of non-struct array");
       }
 
+      if constexpr (std::is_same_v<T, std::shared_ptr<arrow::Field>>) {
+        // For lists, we don't care about the index, jump right into the list value type.
+        // The index here is used in the kernel to grab the specific list element.
+        // Children then are fields from list type, and out will be the children from
+        // that field, thus index 0.
+        if (out != nullptr && (*out)->type()->id() == Type::LIST) {
+          children = get_children(*out);
+          index = 0;
+        } else if (out == nullptr && children->size() == 1 && index >= 0) {
+          // WIP: Perhaps a dangerous assumption?
+          // For list types when previous was not a FieldPath, but
+          // a string name referencing a list type. Then the first iteration
+          // with index (referencing the list item), might look like it's
+          // out of bounds when actually we don't care about this index for
+          // type checking, only the kernel does; we need the list value type
+          // which is now the only child in the children vector.

Review Comment:
   Can you show a concrete example of the case you are thinking of?
   
   (also adding some tests for FieldPath::Get new capabilities (in C++) will probably be helpful to undestand the behaviour)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] milesgranger commented on a diff in pull request #14697: ARROW-18265: [C++][Python] Support FieldRef to work with ListElement

Posted by GitBox <gi...@apache.org>.
milesgranger commented on code in PR #14697:
URL: https://github.com/apache/arrow/pull/14697#discussion_r1029394880


##########
cpp/src/arrow/type.cc:
##########
@@ -1066,12 +1066,32 @@ struct FieldPathGetImpl {
     }
 
     int depth = 0;
-    const T* out;
+    const T* out = nullptr;
     for (int index : path->indices()) {
       if (children == nullptr) {
         return Status::NotImplemented("Get child data of non-struct array");
       }
 
+      if constexpr (std::is_same_v<T, std::shared_ptr<arrow::Field>>) {
+        // For lists, we don't care about the index, jump right into the list value type.
+        // The index here is used in the kernel to grab the specific list element.
+        // Children then are fields from list type, and out will be the children from
+        // that field, thus index 0.
+        if (out != nullptr && (*out)->type()->id() == Type::LIST) {
+          children = get_children(*out);
+          index = 0;
+        } else if (out == nullptr && children->size() == 1 && index >= 0) {
+          // WIP: Perhaps a dangerous assumption?
+          // For list types when previous was not a FieldPath, but
+          // a string name referencing a list type. Then the first iteration
+          // with index (referencing the list item), might look like it's
+          // out of bounds when actually we don't care about this index for
+          // type checking, only the kernel does; we need the list value type
+          // which is now the only child in the children vector.

Review Comment:
   The first 'a', in our example, will be matched [here in the second overloaded member function](https://github.com/apache/arrow/blob/7ae4705c629655eb37b4200229d30f3a2af9e154/cpp/src/arrow/type.cc#L1407) then the second (`[1]`) goes to the first overloaded function taking `FieldPath`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] milesgranger commented on a diff in pull request #14697: ARROW-18265: [C++][Python] Support FieldRef to work with ListElement

Posted by GitBox <gi...@apache.org>.
milesgranger commented on code in PR #14697:
URL: https://github.com/apache/arrow/pull/14697#discussion_r1029371327


##########
cpp/src/arrow/type.cc:
##########
@@ -1066,12 +1066,32 @@ struct FieldPathGetImpl {
     }
 
     int depth = 0;
-    const T* out;
+    const T* out = nullptr;
     for (int index : path->indices()) {
       if (children == nullptr) {
         return Status::NotImplemented("Get child data of non-struct array");
       }
 
+      if constexpr (std::is_same_v<T, std::shared_ptr<arrow::Field>>) {
+        // For lists, we don't care about the index, jump right into the list value type.
+        // The index here is used in the kernel to grab the specific list element.
+        // Children then are fields from list type, and out will be the children from
+        // that field, thus index 0.
+        if (out != nullptr && (*out)->type()->id() == Type::LIST) {
+          children = get_children(*out);
+          index = 0;
+        } else if (out == nullptr && children->size() == 1 && index >= 0) {
+          // WIP: Perhaps a dangerous assumption?
+          // For list types when previous was not a FieldPath, but
+          // a string name referencing a list type. Then the first iteration
+          // with index (referencing the list item), might look like it's
+          // out of bounds when actually we don't care about this index for
+          // type checking, only the kernel does; we need the list value type
+          // which is now the only child in the children vector.

Review Comment:
   Sure, this path happens with the `".a[1].c"` case, where the 'a' goes through string name, and the second '[1]' is through this FieldPath, but is also out of bounds from children (index 1) whereas `".a[0].c"` works b/c 0 is the index we want anyway from the list type. 
   
   I'll work on adding the FieldPath::Get tests in C++. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amol- commented on pull request #14697: ARROW-18265: [C++][Python] Support FieldRef to work with ListElement

Posted by "amol- (via GitHub)" <gi...@apache.org>.
amol- commented on PR #14697:
URL: https://github.com/apache/arrow/pull/14697#issuecomment-1490645934

   Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amol- closed pull request #14697: ARROW-18265: [C++][Python] Support FieldRef to work with ListElement

Posted by "amol- (via GitHub)" <gi...@apache.org>.
amol- closed pull request #14697: ARROW-18265: [C++][Python] Support FieldRef to work with ListElement
URL: https://github.com/apache/arrow/pull/14697


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #14697: ARROW-18265: [C++][Python] Support FieldRef to work with ListElement

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on PR #14697:
URL: https://github.com/apache/arrow/pull/14697#issuecomment-1348529294

   > a purely metadata-based field lookup (the current semantics of `FieldRef`)
   
   I am not sure that this is still entirely correct nowadays, since we already do use `FieldRef` in other places in the compute engine to refer to fields (eg to specify the sort key, in projections (although there wrapped in an expression), ...). 
   And it is certainly true that right now we only support struct/union field lookups, but also those are not purely metadata-based, as the actual lookup of values (not just the type/field from a schema) also needs to potentially calculate validity bitmaps. While for list types, the lookup of the type is also metadata-only. It's only when looking up actual values that you need a nontrivial computation.
   
   > Is this actually important to support?
   
   If you want to work with list types, this is a pretty basic operation. Now, the actual operation is already supported (`list_element` kernel), here it is only about being able to express this in a field reference. So this is somewhat in the bucket of "user convenience", as you can already achieve something similar with the kernel. 
   Now, for example also Substrait has this concept (https://github.com/substrait-io/substrait/blob/7f272f13f22cd5f5842baea42bcf7961e6251881/proto/substrait/algebra.proto#L932-L938)
   
   Another example of database that supports this in queries: https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays#accessing_array_elements


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] milesgranger commented on a diff in pull request #14697: ARROW-18265: [C++][Python] Support FieldRef to work with ListElement

Posted by GitBox <gi...@apache.org>.
milesgranger commented on code in PR #14697:
URL: https://github.com/apache/arrow/pull/14697#discussion_r1029371327


##########
cpp/src/arrow/type.cc:
##########
@@ -1066,12 +1066,32 @@ struct FieldPathGetImpl {
     }
 
     int depth = 0;
-    const T* out;
+    const T* out = nullptr;
     for (int index : path->indices()) {
       if (children == nullptr) {
         return Status::NotImplemented("Get child data of non-struct array");
       }
 
+      if constexpr (std::is_same_v<T, std::shared_ptr<arrow::Field>>) {
+        // For lists, we don't care about the index, jump right into the list value type.
+        // The index here is used in the kernel to grab the specific list element.
+        // Children then are fields from list type, and out will be the children from
+        // that field, thus index 0.
+        if (out != nullptr && (*out)->type()->id() == Type::LIST) {
+          children = get_children(*out);
+          index = 0;
+        } else if (out == nullptr && children->size() == 1 && index >= 0) {
+          // WIP: Perhaps a dangerous assumption?
+          // For list types when previous was not a FieldPath, but
+          // a string name referencing a list type. Then the first iteration
+          // with index (referencing the list item), might look like it's
+          // out of bounds when actually we don't care about this index for
+          // type checking, only the kernel does; we need the list value type
+          // which is now the only child in the children vector.

Review Comment:
   Sure, this path happens with the `".a[1].c"` case, where the 'a' goes through string name, and the second '[1]' is through this FieldPath, but is also out of bounds from children (index 1) whereas `".a[0].c"` works b/c 0 is the index we want anyway from the list type in order to get the list value type. 
   
   I'll work on adding the FieldPath::Get tests in C++. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] milesgranger commented on a diff in pull request #14697: ARROW-18265: [C++][Python] Support FieldRef to work with ListElement

Posted by GitBox <gi...@apache.org>.
milesgranger commented on code in PR #14697:
URL: https://github.com/apache/arrow/pull/14697#discussion_r1029394880


##########
cpp/src/arrow/type.cc:
##########
@@ -1066,12 +1066,32 @@ struct FieldPathGetImpl {
     }
 
     int depth = 0;
-    const T* out;
+    const T* out = nullptr;
     for (int index : path->indices()) {
       if (children == nullptr) {
         return Status::NotImplemented("Get child data of non-struct array");
       }
 
+      if constexpr (std::is_same_v<T, std::shared_ptr<arrow::Field>>) {
+        // For lists, we don't care about the index, jump right into the list value type.
+        // The index here is used in the kernel to grab the specific list element.
+        // Children then are fields from list type, and out will be the children from
+        // that field, thus index 0.
+        if (out != nullptr && (*out)->type()->id() == Type::LIST) {
+          children = get_children(*out);
+          index = 0;
+        } else if (out == nullptr && children->size() == 1 && index >= 0) {
+          // WIP: Perhaps a dangerous assumption?
+          // For list types when previous was not a FieldPath, but
+          // a string name referencing a list type. Then the first iteration
+          // with index (referencing the list item), might look like it's
+          // out of bounds when actually we don't care about this index for
+          // type checking, only the kernel does; we need the list value type
+          // which is now the only child in the children vector.

Review Comment:
   The first 'a', in our example, will be matched [here in the second overloaded member function](https://github.com/apache/arrow/blob/7ae4705c629655eb37b4200229d30f3a2af9e154/cpp/src/arrow/type.cc#L1407) then the second (`[1]`) goes to the first overloaded function taking `FieldPath`.
   
   Edit: possible I'm grossly misunderstanding something though. :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] milesgranger commented on a diff in pull request #14697: ARROW-18265: [C++][Python] Support FieldRef to work with ListElement

Posted by GitBox <gi...@apache.org>.
milesgranger commented on code in PR #14697:
URL: https://github.com/apache/arrow/pull/14697#discussion_r1029403498


##########
cpp/src/arrow/type.cc:
##########
@@ -1066,12 +1066,32 @@ struct FieldPathGetImpl {
     }
 
     int depth = 0;
-    const T* out;
+    const T* out = nullptr;
     for (int index : path->indices()) {
       if (children == nullptr) {
         return Status::NotImplemented("Get child data of non-struct array");
       }
 
+      if constexpr (std::is_same_v<T, std::shared_ptr<arrow::Field>>) {
+        // For lists, we don't care about the index, jump right into the list value type.
+        // The index here is used in the kernel to grab the specific list element.
+        // Children then are fields from list type, and out will be the children from
+        // that field, thus index 0.
+        if (out != nullptr && (*out)->type()->id() == Type::LIST) {
+          children = get_children(*out);
+          index = 0;
+        } else if (out == nullptr && children->size() == 1 && index >= 0) {
+          // WIP: Perhaps a dangerous assumption?
+          // For list types when previous was not a FieldPath, but
+          // a string name referencing a list type. Then the first iteration
+          // with index (referencing the list item), might look like it's
+          // out of bounds when actually we don't care about this index for
+          // type checking, only the kernel does; we need the list value type
+          // which is now the only child in the children vector.

Review Comment:
   Example when giving a wrong path, the error makes it more apparent than what I'm trying to explain maybe:
   ```python
   >>> arr = pa.array([{'a': [{'b': 1}, {'c': 'hi'}]}])
   >>> pc.struct_field(arr, '.a[0].d')
   ...
   No match for FieldRef.Nested(FieldRef.Name(a) FieldRef.FieldPath(0) FieldRef.Name(d)) in struct<a: list<item: struct<b: int64, c: string>>>
   ```
   
   As opposed to all indices:
   ```python
   >>> pc.struct_field(arr, [0, 0, 2)
   ...
   No match for FieldRef.FieldPath(0 0 2) in struct<a: list<item: struct<b: int64, c: string>>>
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org