You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "Light-City (via GitHub)" <gi...@apache.org> on 2023/04/13 06:15:27 UTC

[GitHub] [arrow] Light-City opened a new pull request, #35098: GH-35097: [C++] ArrayData support for child_data slice.

Light-City opened a new pull request, #35098:
URL: https://github.com/apache/arrow/pull/35098

   ### Rationale for this change
   
   issue: [35097](https://github.com/apache/arrow/issues/35097)
   
   ### What changes are included in this PR?
   
   change data.cc, for ArrayData::Slice, add child_data logic.
   change  array_struct_test.cc, add ChildSlice test.
   
   ### Are these changes tested?
   
   yes! see array_struct_test.cc.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] danepitkin commented on pull request #35098: GH-35097: [C++] ArrayData support for child_data slice.

Posted by "danepitkin (via GitHub)" <gi...@apache.org>.
danepitkin commented on PR #35098:
URL: https://github.com/apache/arrow/pull/35098#issuecomment-1579015789

   Hey @Light-City , would you consider moving this PR to draft mode while awaiting changes?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] Light-City commented on a diff in pull request #35098: GH-35097: [C++] ArrayData support for child_data slice.

Posted by "Light-City (via GitHub)" <gi...@apache.org>.
Light-City commented on code in PR #35098:
URL: https://github.com/apache/arrow/pull/35098#discussion_r1168122295


##########
cpp/src/arrow/array/data.cc:
##########
@@ -144,6 +144,8 @@ std::shared_ptr<ArrayData> ArrayData::Slice(int64_t off, int64_t len) const {
   } else {
     copy->null_count = null_count != 0 ? kUnknownNullCount : 0;
   }
+  for (auto& child : copy->child_data) 
+    child = child->Slice(copy->offset, copy->length);

Review Comment:
   > > Can we provide arraydata with slice operations on child_data?
   > 
   > What is the use case? Both `ArrayData` and `Buffer` do not provide slice operations so it would be non-trivial to do this.
   
   case: I have an two phase avg.
   
   phase1: returns an intermediate result struct array<sum, count>.
   
   phase2: input: struct array<sum, count>, output: float8, in Consume, I imitate SumImpl to implement addition to struct<sum, count>,split it to sum_arr and count_arr.
   
   just like SumImpl,  `this->sum += SumArray<CType, SumCType, SimdLevel>(*data);` data is from :`const auto& data = batch[0].array();`. it is an arraydata not array.
   
   so for our phase 2 consum, input data is a struct array data, code like this:
   ```cpp
   auto arraydata = *(batch[0].array());
   ConsumeSumArray(arraydata.child_data[0]);
   ConsumeCountArray(arraydata.child_data[1]);
   ```
   
   But in the grouping scenario, I may have my batch sliced, so arraydata should also be sliced, so I have to do this here
   
   ```
   auto array_data = batch[0].array();
   auto struct_array = std::make_shared<StructArray>(array_data);
   ConsumeSumArray(struct_array->field(0)->data());
   ConsumeCountArray(struct_array->field(1)->data());
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] Light-City commented on a diff in pull request #35098: GH-35097: [C++] ArrayData support for child_data slice.

Posted by "Light-City (via GitHub)" <gi...@apache.org>.
Light-City commented on code in PR #35098:
URL: https://github.com/apache/arrow/pull/35098#discussion_r1177176973


##########
cpp/src/arrow/array/data.cc:
##########
@@ -144,6 +144,8 @@ std::shared_ptr<ArrayData> ArrayData::Slice(int64_t off, int64_t len) const {
   } else {
     copy->null_count = null_count != 0 ? kUnknownNullCount : 0;
   }
+  for (auto& child : copy->child_data) 
+    child = child->Slice(copy->offset, copy->length);

Review Comment:
   > 
   
   good idea1



##########
cpp/src/arrow/array/data.cc:
##########
@@ -144,6 +144,8 @@ std::shared_ptr<ArrayData> ArrayData::Slice(int64_t off, int64_t len) const {
   } else {
     copy->null_count = null_count != 0 ? kUnknownNullCount : 0;
   }
+  for (auto& child : copy->child_data) 
+    child = child->Slice(copy->offset, copy->length);

Review Comment:
   > 
   
   good idea!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35098: GH-35097: [C++] ArrayData support for child_data slice.

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35098:
URL: https://github.com/apache/arrow/pull/35098#issuecomment-1506407105

   * Closes: #35097


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #35098: GH-35097: [C++] ArrayData support for child_data slice.

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on code in PR #35098:
URL: https://github.com/apache/arrow/pull/35098#discussion_r1168125919


##########
cpp/src/arrow/array/data.cc:
##########
@@ -144,6 +144,8 @@ std::shared_ptr<ArrayData> ArrayData::Slice(int64_t off, int64_t len) const {
   } else {
     copy->null_count = null_count != 0 ? kUnknownNullCount : 0;
   }
+  for (auto& child : copy->child_data) 
+    child = child->Slice(copy->offset, copy->length);

Review Comment:
   In this case, you should explicitly pass around the parent offset or adjust the child offsets. 
   
   It may be reasonable to provide a method on Array that can pre-adjust the children for you (without actually copying the data).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #35098: GH-35097: [C++] ArrayData support for child_data slice.

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #35098:
URL: https://github.com/apache/arrow/pull/35098#discussion_r1167360957


##########
cpp/src/arrow/array/data.cc:
##########
@@ -144,6 +144,8 @@ std::shared_ptr<ArrayData> ArrayData::Slice(int64_t off, int64_t len) const {
   } else {
     copy->null_count = null_count != 0 ? kUnknownNullCount : 0;
   }
+  for (auto& child : copy->child_data) 
+    child = child->Slice(copy->offset, copy->length);

Review Comment:
   > Can we provide arraydata with slice operations on child_data?
   
   What is the use case? Both `ArrayData` and `Buffer` do not provide slice operations so it would be non-trivial to do this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] Light-City commented on a diff in pull request #35098: GH-35097: [C++] ArrayData support for child_data slice.

Posted by "Light-City (via GitHub)" <gi...@apache.org>.
Light-City commented on code in PR #35098:
URL: https://github.com/apache/arrow/pull/35098#discussion_r1168503823


##########
cpp/src/arrow/array/data.cc:
##########
@@ -144,6 +144,8 @@ std::shared_ptr<ArrayData> ArrayData::Slice(int64_t off, int64_t len) const {
   } else {
     copy->null_count = null_count != 0 ? kUnknownNullCount : 0;
   }
+  for (auto& child : copy->child_data) 
+    child = child->Slice(copy->offset, copy->length);

Review Comment:
   OK, thank you



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] Light-City commented on a diff in pull request #35098: GH-35097: [C++] ArrayData support for child_data slice.

Posted by "Light-City (via GitHub)" <gi...@apache.org>.
Light-City commented on code in PR #35098:
URL: https://github.com/apache/arrow/pull/35098#discussion_r1166506233


##########
cpp/src/arrow/array/data.cc:
##########
@@ -144,6 +144,8 @@ std::shared_ptr<ArrayData> ArrayData::Slice(int64_t off, int64_t len) const {
   } else {
     copy->null_count = null_count != 0 ? kUnknownNullCount : 0;
   }
+  for (auto& child : copy->child_data) 
+    child = child->Slice(copy->offset, copy->length);

Review Comment:
   Can we provide arraydata with slice operations on child_data?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on pull request #35098: GH-35097: [C++] ArrayData support for child_data slice.

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on PR #35098:
URL: https://github.com/apache/arrow/pull/35098#issuecomment-1579036512

   > Hey @Light-City , would you consider moving this PR to draft mode while awaiting changes?
   
   I think the PR can be closed (see https://github.com/apache/arrow/pull/35098#discussion_r1170157451 for a justification) and a new one could be opened if the author is interested in implementing array materialization.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35098: GH-35097: [C++] ArrayData support for child_data slice.

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35098:
URL: https://github.com/apache/arrow/pull/35098#issuecomment-1506407138

   :warning: GitHub issue #35097 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #35098: GH-35097: [C++] ArrayData support for child_data slice.

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #35098:
URL: https://github.com/apache/arrow/pull/35098#discussion_r1170157451


##########
cpp/src/arrow/array/data.cc:
##########
@@ -144,6 +144,8 @@ std::shared_ptr<ArrayData> ArrayData::Slice(int64_t off, int64_t len) const {
   } else {
     copy->null_count = null_count != 0 ? kUnknownNullCount : 0;
   }
+  for (auto& child : copy->child_data) 
+    child = child->Slice(copy->offset, copy->length);

Review Comment:
   @Light-City *avoiding materialization* is a common theme of query execution. Databases are often working on data that is larger than memory or taking all the memory with the dataset itself. Allocating more memory to produce intermediate results is a no-no. And this tradition is honored by the Arrow design. That's why it can be seen as a bit counterintuitive for people used to array functions of common programming languages.
   
   For instance, in JavaScript, `Array.prototype.slice(begin, end)` creates a new array. In database-speak this is worded as "arr.slice() materializes the slice". An alternative implementation could return an object with a reference to the original array and the bounds of the slice to avoid memory allocation. That would require that every function you normally use to work with arrays be aware of the slice boundaries. It would complicate JavaScript too much, but for Arrow that is exactly the intended design. Every compute kernel in Arrow has to be aware of the offset and length to minimize materialization, thus minimizing memory consumption.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] lidavidm commented on a diff in pull request #35098: GH-35097: [C++] ArrayData support for child_data slice.

Posted by "lidavidm (via GitHub)" <gi...@apache.org>.
lidavidm commented on code in PR #35098:
URL: https://github.com/apache/arrow/pull/35098#discussion_r1166390730


##########
cpp/src/arrow/array/data.cc:
##########
@@ -144,6 +144,8 @@ std::shared_ptr<ArrayData> ArrayData::Slice(int64_t off, int64_t len) const {
   } else {
     copy->null_count = null_count != 0 ? kUnknownNullCount : 0;
   }
+  for (auto& child : copy->child_data) 
+    child = child->Slice(copy->offset, copy->length);

Review Comment:
   Yes, the expectation *in the C++ implementation* is that slicing an array is a metadata-only operation that just adjusts the "offset" of an array, and does not affect child arrays. That means code directly manipulating child arrays must (recursively) account for parent array offsets.
   
   However, it may be good to document how to 'materialize' or slice the child array(s) to account for this. I'm not aware of a convenient way to do this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] Light-City commented on a diff in pull request #35098: GH-35097: [C++] ArrayData support for child_data slice.

Posted by "Light-City (via GitHub)" <gi...@apache.org>.
Light-City commented on code in PR #35098:
URL: https://github.com/apache/arrow/pull/35098#discussion_r1168122864


##########
cpp/src/arrow/array/data.cc:
##########
@@ -144,6 +144,8 @@ std::shared_ptr<ArrayData> ArrayData::Slice(int64_t off, int64_t len) const {
   } else {
     copy->null_count = null_count != 0 ? kUnknownNullCount : 0;
   }
+  for (auto& child : copy->child_data) 
+    child = child->Slice(copy->offset, copy->length);

Review Comment:
   The essence is that the agg function operates on arraydata. If I need a slice, I must call it explicitly or construct an array before using it. Is it reasonable?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #35098: GH-35097: [C++] ArrayData support for child_data slice.

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #35098:
URL: https://github.com/apache/arrow/pull/35098#discussion_r1166248633


##########
cpp/src/arrow/array/data.cc:
##########
@@ -144,6 +144,8 @@ std::shared_ptr<ArrayData> ArrayData::Slice(int64_t off, int64_t len) const {
   } else {
     copy->null_count = null_count != 0 ? kUnknownNullCount : 0;
   }
+  for (auto& child : copy->child_data) 
+    child = child->Slice(copy->offset, copy->length);

Review Comment:
   I think this breaks the contract of a sliced array. Please check the docstring of `ArrayData`
   ```
     // The logical start point into the physical buffers (in values, not bytes).
     // Note that, for child data, this must be *added* to the child data's own offset.
     int64_t offset = 0;
   ```



##########
cpp/src/arrow/array/data.cc:
##########
@@ -144,6 +144,8 @@ std::shared_ptr<ArrayData> ArrayData::Slice(int64_t off, int64_t len) const {
   } else {
     copy->null_count = null_count != 0 ? kUnknownNullCount : 0;
   }
+  for (auto& child : copy->child_data) 
+    child = child->Slice(copy->offset, copy->length);

Review Comment:
   cc @lidavidm @westonpace 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org