You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/20 15:33:04 UTC

[GitHub] [arrow-rs] jhorstmann opened a new pull request, #1602: Don't access and validate offset buffer in ListArray::from(ArrayData)

jhorstmann opened a new pull request, #1602:
URL: https://github.com/apache/arrow-rs/pull/1602

   # Which issue does this PR close?
   
   <!---
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #1601.
   
   # Rationale for this change
   
   All required validation should already be done as part of the ArrayData creation. 
   
    <!---
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
   -->
   
   # What changes are included in this PR?
   
   <!---
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   
   
   <!---
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] jhorstmann commented on a diff in pull request #1602: Don't access and validate offset buffer in ListArray::from(ArrayData)

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on code in PR #1602:
URL: https://github.com/apache/arrow-rs/pull/1602#discussion_r854389543


##########
arrow/src/array/array_list.rs:
##########
@@ -524,6 +516,32 @@ mod tests {
         assert_eq!(list_array, another)
     }
 
+    #[test]
+    fn test_empty_list_array() {
+        // Construct an empty value array
+        let value_data = ArrayData::builder(DataType::Int32)
+            .len(0)
+            .add_buffer(Buffer::from_iter(std::iter::empty::<i32>()))

Review Comment:
   Good idea checking the C++ version. The docs for binary layout also mention
   
   > The offsets buffer contains length + 1 signed integers
   > ...
   > and the last slot is the length of the values array
   
   which would mean there has to be a single zero in the offsets buffer for an empty `ListArray`.
   
   If this is a requirements it would be better to validate it when creating `ArrayData`. The code in [`ArrayData::validate_each_offset`](https://github.com/apache/arrow-rs/blob/b4642ecf9cb4a3d8b5741655fdc7d06b71bf41e6/arrow/src/array/data.rs#L1045) explicitly allow this case though:
   
   ```rust
           // An empty binary-like array can have 0 offsets
           if self.len == 0 && offset_buffer.is_empty() {
               return Ok(());
           }
   ```
   
   I'm more hesitant to change this now, maybe let's wait for some more eyes on this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1602: Don't access and validate offset buffer in ListArray::from(ArrayData)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1602:
URL: https://github.com/apache/arrow-rs/pull/1602#discussion_r854292866


##########
arrow/src/array/array_list.rs:
##########
@@ -524,6 +516,32 @@ mod tests {
         assert_eq!(list_array, another)
     }
 
+    #[test]
+    fn test_empty_list_array() {
+        // Construct an empty value array
+        let value_data = ArrayData::builder(DataType::Int32)
+            .len(0)
+            .add_buffer(Buffer::from_iter(std::iter::empty::<i32>()))
+            .build()
+            .unwrap();
+
+        // Construct an empty offset buffer
+        let value_offsets = Buffer::from_iter(std::iter::empty::<i32>());

Review Comment:
   `Buffer::from([])` might be slightly cleaner?



##########
arrow/src/array/array_list.rs:
##########
@@ -1110,8 +1128,7 @@ mod tests {
     }
 
     #[test]
-    #[should_panic(expected = "offsets do not start at zero")]
-    fn test_list_array_invalid_value_offset_start() {
+    fn test_list_array_offsets_need_not_start_at_zero() {

Review Comment:
   The spec doesn't explicitly state either way on this, however, for variable length lists (e.g. UTF-8) it states
   
   > Generally the first slot in the offsets array is 0, and the last slot is the length of the values array. When serializing this layout, we recommend normalizing the offsets to start at 0.
   
   So I think this is likely correct



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1602: Don't access and validate offset buffer in ListArray::from(ArrayData)

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1602:
URL: https://github.com/apache/arrow-rs/pull/1602#discussion_r854360680


##########
arrow/src/array/array_list.rs:
##########
@@ -524,6 +516,32 @@ mod tests {
         assert_eq!(list_array, another)
     }
 
+    #[test]
+    fn test_empty_list_array() {
+        // Construct an empty value array
+        let value_data = ArrayData::builder(DataType::Int32)
+            .len(0)
+            .add_buffer(Buffer::from_iter(std::iter::empty::<i32>()))

Review Comment:
   Should the offsets for an empty ListArray to be something like [0, 0]?
   
   nvm: that will be an zero-length list element.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #1602: Don't access and validate offset buffer in ListArray::from(ArrayData)

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #1602:
URL: https://github.com/apache/arrow-rs/pull/1602#discussion_r858002223


##########
arrow/src/array/array_list.rs:
##########
@@ -524,6 +516,32 @@ mod tests {
         assert_eq!(list_array, another)
     }
 
+    #[test]
+    fn test_empty_list_array() {
+        // Construct an empty value array
+        let value_data = ArrayData::builder(DataType::Int32)
+            .len(0)
+            .add_buffer(Buffer::from_iter(std::iter::empty::<i32>()))

Review Comment:
   Filed https://github.com/apache/arrow-rs/issues/1620 as the follow on work 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] jhorstmann commented on a diff in pull request #1602: Don't access and validate offset buffer in ListArray::from(ArrayData)

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on code in PR #1602:
URL: https://github.com/apache/arrow-rs/pull/1602#discussion_r856872978


##########
arrow/src/array/array_list.rs:
##########
@@ -524,6 +516,32 @@ mod tests {
         assert_eq!(list_array, another)
     }
 
+    #[test]
+    fn test_empty_list_array() {
+        // Construct an empty value array
+        let value_data = ArrayData::builder(DataType::Int32)
+            .len(0)
+            .add_buffer(Buffer::from_iter(std::iter::empty::<i32>()))

Review Comment:
   Seems like [arrow2 also requires at least one offset](https://github.com/jorgecarleitao/arrow2/blob/8fb3b8d3f05cdc3d51f1314cfeb9bec39196789c/src/array/specification.rs#L101)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1602: Don't access and validate offset buffer in ListArray::from(ArrayData)

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1602:
URL: https://github.com/apache/arrow-rs/pull/1602#discussion_r854358105


##########
arrow/src/array/array_list.rs:
##########
@@ -524,6 +516,32 @@ mod tests {
         assert_eq!(list_array, another)
     }
 
+    #[test]
+    fn test_empty_list_array() {
+        // Construct an empty value array
+        let value_data = ArrayData::builder(DataType::Int32)
+            .len(0)
+            .add_buffer(Buffer::from_iter(std::iter::empty::<i32>()))

Review Comment:
   Hmm, I don't see C++ ListArray checks first slot in offsets, but it checks the length of offsets:
   
   https://github.com/apache/arrow/blob/c70426f73326b3852d1bd7c31d98be4743f3fcba/cpp/src/arrow/array/array_nested.cc#L111-L113
   
   Seems it requires offsets to have non-zero length?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb merged pull request #1602: Don't access and validate offset buffer in ListArray::from(ArrayData)

Posted by GitBox <gi...@apache.org>.
alamb merged PR #1602:
URL: https://github.com/apache/arrow-rs/pull/1602


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on pull request #1602: Don't access and validate offset buffer in ListArray::from(ArrayData)

Posted by GitBox <gi...@apache.org>.
viirya commented on PR #1602:
URL: https://github.com/apache/arrow-rs/pull/1602#issuecomment-1104172322

   The CI failure looks unrelated:
   
   ```
   Error response from daemon: Head "https://registry-1.docker.io/v2/amd64/rust/manifests/latest": received unexpected HTTP status: 503 Service Unavailable
     Error: Docker pull failed with exit code 1
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on pull request #1602: Don't access and validate offset buffer in ListArray::from(ArrayData)

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #1602:
URL: https://github.com/apache/arrow-rs/pull/1602#issuecomment-1108407248

   I took the liberty of fixing the clippy errror in 225d86c836. I am also going to see if this change helps with https://github.com/apache/arrow-rs/issues/1545 at all
   
   Thanks @jhorstmann  and all reviewers


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1602: Don't access and validate offset buffer in ListArray::from(ArrayData)

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1602:
URL: https://github.com/apache/arrow-rs/pull/1602#discussion_r854360680


##########
arrow/src/array/array_list.rs:
##########
@@ -524,6 +516,32 @@ mod tests {
         assert_eq!(list_array, another)
     }
 
+    #[test]
+    fn test_empty_list_array() {
+        // Construct an empty value array
+        let value_data = ArrayData::builder(DataType::Int32)
+            .len(0)
+            .add_buffer(Buffer::from_iter(std::iter::empty::<i32>()))

Review Comment:
   Should the offsets for an empty ListArray to be something like [0, 0]?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1602: Don't access and validate offset buffer in ListArray::from(ArrayData)

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1602:
URL: https://github.com/apache/arrow-rs/pull/1602#discussion_r854360680


##########
arrow/src/array/array_list.rs:
##########
@@ -524,6 +516,32 @@ mod tests {
         assert_eq!(list_array, another)
     }
 
+    #[test]
+    fn test_empty_list_array() {
+        // Construct an empty value array
+        let value_data = ArrayData::builder(DataType::Int32)
+            .len(0)
+            .add_buffer(Buffer::from_iter(std::iter::empty::<i32>()))

Review Comment:
   <del>Should the offsets for an empty ListArray to be something like [0, 0]?</del>
   
   nvm: that will be an zero-length list element.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] codecov-commenter commented on pull request #1602: Don't access and validate offset buffer in ListArray::from(ArrayData)

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on PR #1602:
URL: https://github.com/apache/arrow-rs/pull/1602#issuecomment-1104105256

   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1602?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1602](https://codecov.io/gh/apache/arrow-rs/pull/1602?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (34337cc) into [master](https://codecov.io/gh/apache/arrow-rs/commit/b4642ecf9cb4a3d8b5741655fdc7d06b71bf41e6?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (b4642ec) will **increase** coverage by `0.00%`.
   > The diff coverage is `100.00%`.
   
   > :exclamation: Current head 34337cc differs from pull request most recent head 7755e0d. Consider uploading reports for the commit 7755e0d to get more accurate results
   
   ```diff
   @@           Coverage Diff           @@
   ##           master    #1602   +/-   ##
   =======================================
     Coverage   82.95%   82.95%           
   =======================================
     Files         193      193           
     Lines       55384    55394   +10     
   =======================================
   + Hits        45944    45953    +9     
   - Misses       9440     9441    +1     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1602?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [arrow/src/array/array\_list.rs](https://codecov.io/gh/apache/arrow-rs/pull/1602/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2FycmF5X2xpc3QucnM=) | `95.61% <100.00%> (+0.08%)` | :arrow_up: |
   | [arrow/src/datatypes/datatype.rs](https://codecov.io/gh/apache/arrow-rs/pull/1602/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9kYXRhdHlwZS5ycw==) | `66.40% <0.00%> (-0.40%)` | :arrow_down: |
   | [parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow-rs/pull/1602/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZW5jb2RpbmdzL2VuY29kaW5nLnJz) | `93.37% <0.00%> (-0.19%)` | :arrow_down: |
   | [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/1602/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `86.46% <0.00%> (-0.12%)` | :arrow_down: |
   | [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1602/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `66.43% <0.00%> (+0.45%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1602?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1602?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [b4642ec...7755e0d](https://codecov.io/gh/apache/arrow-rs/pull/1602?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1602: Don't access and validate offset buffer in ListArray::from(ArrayData)

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1602:
URL: https://github.com/apache/arrow-rs/pull/1602#discussion_r854360680


##########
arrow/src/array/array_list.rs:
##########
@@ -524,6 +516,32 @@ mod tests {
         assert_eq!(list_array, another)
     }
 
+    #[test]
+    fn test_empty_list_array() {
+        // Construct an empty value array
+        let value_data = ArrayData::builder(DataType::Int32)
+            .len(0)
+            .add_buffer(Buffer::from_iter(std::iter::empty::<i32>()))

Review Comment:
   <del>Should the offsets for an empty ListArray to be something like [0, 0]?</del>
   
   nvm: that will result in an zero-length list element.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1602: Don't access and validate offset buffer in ListArray::from(ArrayData)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1602:
URL: https://github.com/apache/arrow-rs/pull/1602#discussion_r856874989


##########
arrow/src/array/array_list.rs:
##########
@@ -524,6 +516,32 @@ mod tests {
         assert_eq!(list_array, another)
     }
 
+    #[test]
+    fn test_empty_list_array() {
+        // Construct an empty value array
+        let value_data = ArrayData::builder(DataType::Int32)
+            .len(0)
+            .add_buffer(Buffer::from_iter(std::iter::empty::<i32>()))

Review Comment:
   Perhaps a ticket to change this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org