You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "wjones127 (via GitHub)" <gi...@apache.org> on 2023/03/22 23:14:20 UTC

[GitHub] [arrow] wjones127 opened a new pull request, #34697: GH-34696: [C++] Check REE arrays have no null buffer in Validate()

wjones127 opened a new pull request, #34697:
URL: https://github.com/apache/arrow/pull/34697

   ### Rationale for this change
   
   See #34696
   
   ### What changes are included in this PR?
   
   Adds another check. Also removes DCHECKs from `SetData()`, since they are redundant with the `Validate()` checks and I don't think we do that for other arrays. Having them present made it harder to get to the root cause of this error, because `SetData()` is called on the array when validating a record batch.
   
   ### Are these changes tested?
   
   * [ ] Add tests
   
   ### Are there any user-facing changes?
   
   **This PR contains a "Critical Fix".** Without this, sending a REE array with an invalid null buffer through IPC to the C++ implementation will cause it to crash if the array is validated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #34697: GH-34696: [C++] Check REE arrays have no null buffer in Validate()

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #34697:
URL: https://github.com/apache/arrow/pull/34697#discussion_r1147630352


##########
cpp/src/arrow/array/array_run_end.cc:
##########
@@ -74,19 +74,6 @@ void RunEndEncodedArray::SetData(const std::shared_ptr<ArrayData>& data) {
   ARROW_CHECK_EQ(ree_type->run_end_type()->id(), data->child_data[0]->type->id());
   ARROW_CHECK_EQ(ree_type->value_type()->id(), data->child_data[1]->type->id());
 
-  DCHECK_EQ(data->child_data.size(), 2);
-
-  // A non-zero number of logical values in this array (offset + length) implies
-  // a non-zero number of runs and values.
-  DCHECK(data->offset + data->length == 0 || data->child_data[0]->length > 0);
-  DCHECK(data->offset + data->length == 0 || data->child_data[1]->length > 0);
-  // At least as many values as run_ends
-  DCHECK_GE(data->child_data[1]->length, data->child_data[0]->length);
-
-  // The null count for run-end encoded arrays is always 0. Actual number of
-  // nulls needs to be calculated through other means.
-  DCHECK_EQ(data->null_count, 0);
-
   Array::SetData(data);
   run_ends_array_ = MakeArray(this->data()->child_data[0]);
   values_array_ = MakeArray(this->data()->child_data[1]);

Review Comment:
   This will segfault if `child_data.size() < 2`, so maybe we should wrap this in size checks to allow invalid invalid data to go through until `Validate` checks it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34697: GH-34696: [C++] Check REE arrays have no null buffer in Validate()

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34697:
URL: https://github.com/apache/arrow/pull/34697#issuecomment-1480373493

   * Closes: #34696


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #34697: GH-34696: [C++] Check REE arrays have no null buffer in Validate()

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #34697:
URL: https://github.com/apache/arrow/pull/34697#issuecomment-1480744180

   cc @felipecrv 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #34697: GH-34696: [C++] Check REE arrays have no null buffer in Validate()

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #34697:
URL: https://github.com/apache/arrow/pull/34697#issuecomment-1487662122

   Benchmark runs are scheduled for baseline = 0ead719bdcc90bdeee4d225f7bb9dab1c458eccf and contender = 02bc24cd531ae7597dc5c93c85293870c57d8b73. 02bc24cd531ae7597dc5c93c85293870c57d8b73 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/5f1c24a0f91b4fc393ba243969780208...abc4b4ac748f45d2b7e449b923d599c4/)
   [Finished :arrow_down:0.3% :arrow_up:0.3%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/33553e5726a946e385906b16635c973a...a978bee780444c0f8a48cf298ab90073/)
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/7607c4a79305488d881b40a08652336e...e77348bf08384fd79c85a5fe3aa6d4af/)
   [Finished :arrow_down:0.55% :arrow_up:0.15%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/92f178f54c124f5c9d96d9fdc662e12c...20ed76644dec4be3936192b3d038ee7d/)
   Buildkite builds:
   [Finished] [`02bc24cd` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2590)
   [Finished] [`02bc24cd` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2620)
   [Finished] [`02bc24cd` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2588)
   [Finished] [`02bc24cd` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2611)
   [Finished] [`0ead719b` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2589)
   [Finished] [`0ead719b` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2619)
   [Finished] [`0ead719b` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2587)
   [Finished] [`0ead719b` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2610)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on pull request #34697: GH-34696: [C++] Check REE arrays have no null buffer in Validate()

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on PR #34697:
URL: https://github.com/apache/arrow/pull/34697#issuecomment-1485387051

   > as RUN_END_ENCODED is a new type, we could add this layout validation in this switch without breaking existing code as REEs are not used yet. This is our chance :-)
   
   So we have added an equivalent check to this within the REE array path:
   
   https://github.com/apache/arrow/blob/4a0fc2868cbb7c2e78f76855ad476c260b92b495/cpp/src/arrow/array/validate.cc#L648-L650
   
   My questions is whether we should enable this for other arrays that have always null buffers, such as NullArray and UnionArray. It seems we don't enforce this for these other types right now in validate, and I'd rather not change that in this PR. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on a diff in pull request #34697: GH-34696: [C++] Check REE arrays have no null buffer in Validate()

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on code in PR #34697:
URL: https://github.com/apache/arrow/pull/34697#discussion_r1149460638


##########
cpp/src/arrow/array/array_run_end.cc:
##########
@@ -74,19 +74,6 @@ void RunEndEncodedArray::SetData(const std::shared_ptr<ArrayData>& data) {
   ARROW_CHECK_EQ(ree_type->run_end_type()->id(), data->child_data[0]->type->id());
   ARROW_CHECK_EQ(ree_type->value_type()->id(), data->child_data[1]->type->id());
 
-  DCHECK_EQ(data->child_data.size(), 2);
-
-  // A non-zero number of logical values in this array (offset + length) implies
-  // a non-zero number of runs and values.
-  DCHECK(data->offset + data->length == 0 || data->child_data[0]->length > 0);
-  DCHECK(data->offset + data->length == 0 || data->child_data[1]->length > 0);
-  // At least as many values as run_ends
-  DCHECK_GE(data->child_data[1]->length, data->child_data[0]->length);
-
-  // The null count for run-end encoded arrays is always 0. Actual number of
-  // nulls needs to be calculated through other means.
-  DCHECK_EQ(data->null_count, 0);
-
   Array::SetData(data);
   run_ends_array_ = MakeArray(this->data()->child_data[0]);
   values_array_ = MakeArray(this->data()->child_data[1]);

Review Comment:
   Added a check again, but also moved it up since we index into `child_data` in earlier checks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 merged pull request #34697: GH-34696: [C++] Check REE arrays have no null buffer in Validate()

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 merged PR #34697:
URL: https://github.com/apache/arrow/pull/34697


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34697: GH-34696: [C++] Check REE arrays have no null buffer in Validate()

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34697:
URL: https://github.com/apache/arrow/pull/34697#issuecomment-1480373509

   :warning: GitHub issue #34696 **has no components**, please add labels for components.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on pull request #34697: GH-34696: [C++] Check REE arrays have no null buffer in Validate()

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on PR #34697:
URL: https://github.com/apache/arrow/pull/34697#issuecomment-1481985644

   Alternatively, should we be raising here?
   
   https://github.com/apache/arrow/blob/5bbaf6bade2232a0d60e912defb97579a11db83f/cpp/src/arrow/array/validate.cc#L492-L494


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on pull request #34697: GH-34696: [C++] Check REE arrays have no null buffer in Validate()

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on PR #34697:
URL: https://github.com/apache/arrow/pull/34697#issuecomment-1482833193

   Thank you for fixing this @wjones127. 
   
   > Alternatively, should we be raising here?
   > 
   > https://github.com/apache/arrow/blob/5bbaf6bade2232a0d60e912defb97579a11db83f/cpp/src/arrow/array/validate.cc#L492-L494
   
   @wjones127 as RUN_END_ENCODED is a new type, we could add this layout validation in this switch without breaking existing code as REEs are not used yet. This is our chance :-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org