You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/23 04:46:00 UTC

[GitHub] [arrow-rs] novemberkilo opened a new pull request #1477: Test and fix for bug writing a null list.

novemberkilo opened a new pull request #1477:
URL: https://github.com/apache/arrow-rs/pull/1477


   # Which issue does this PR close?
   
   Closes #1036 
   
   # What changes are included in this PR?
   
   A test and a minor fix. I'm not sure if the fix is a bit naive or not.
   
   This was blocked by #1399 but now that the reader support is in for null lists, I am resurrecting this work.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter commented on pull request #1477: Test and fix for bug writing a null list.

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #1477:
URL: https://github.com/apache/arrow-rs/pull/1477#issuecomment-1075921016


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1477?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1477](https://codecov.io/gh/apache/arrow-rs/pull/1477?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (308e79e) into [master](https://codecov.io/gh/apache/arrow-rs/commit/d384836296bdf4694742c03bb51abbd87589aed6?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (d384836) will **increase** coverage by `0.00%`.
   > The diff coverage is `94.11%`.
   
   ```diff
   @@           Coverage Diff           @@
   ##           master    #1477   +/-   ##
   =======================================
     Coverage   82.70%   82.71%           
   =======================================
     Files         187      187           
     Lines       54208    54224   +16     
   =======================================
   + Hits        44832    44849   +17     
   + Misses       9376     9375    -1     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1477?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [parquet/src/arrow/levels.rs](https://codecov.io/gh/apache/arrow-rs/pull/1477/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvbGV2ZWxzLnJz) | `84.71% <66.66%> (+0.03%)` | :arrow_up: |
   | [parquet/src/arrow/arrow\_writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/1477/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyb3dfd3JpdGVyLnJz) | `97.60% <100.00%> (+0.04%)` | :arrow_up: |
   | [arrow/src/datatypes/datatype.rs](https://codecov.io/gh/apache/arrow-rs/pull/1477/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9kYXRhdHlwZS5ycw==) | `66.80% <0.00%> (+0.39%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1477?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1477?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [d384836...308e79e](https://codecov.io/gh/apache/arrow-rs/pull/1477?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] novemberkilo commented on a change in pull request #1477: Test and fix for bug writing a null list.

Posted by GitBox <gi...@apache.org>.
novemberkilo commented on a change in pull request #1477:
URL: https://github.com/apache/arrow-rs/pull/1477#discussion_r832863894



##########
File path: parquet/src/arrow/levels.rs
##########
@@ -757,7 +757,10 @@ impl LevelInfo {
 
     /// Given a level's information, calculate the offsets required to index an array correctly.
     pub(crate) fn filter_array_indices(&self) -> Vec<usize> {

Review comment:
       Sounds sensible thanks @viirya I will try adding one.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a change in pull request #1477: Test and fix for bug writing a null list.

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1477:
URL: https://github.com/apache/arrow-rs/pull/1477#discussion_r833638633



##########
File path: parquet/src/arrow/levels.rs
##########
@@ -757,7 +757,10 @@ impl LevelInfo {
 
     /// Given a level's information, calculate the offsets required to index an array correctly.
     pub(crate) fn filter_array_indices(&self) -> Vec<usize> {

Review comment:
       Ok I have a fix but whilst writing this I've found what appears to be yet another bug :disappointed: 
   
   Unfortunately it is EOD here, but I'll pick this back up first thing tomorrow (GMT) and see if I can't get a PR to fix both this bug and the one I've just discovered




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] viirya commented on a change in pull request #1477: Test and fix for bug writing a null list.

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #1477:
URL: https://github.com/apache/arrow-rs/pull/1477#discussion_r832861148



##########
File path: parquet/src/arrow/levels.rs
##########
@@ -757,7 +757,10 @@ impl LevelInfo {
 
     /// Given a level's information, calculate the offsets required to index an array correctly.
     pub(crate) fn filter_array_indices(&self) -> Vec<usize> {

Review comment:
       Should we have a test case for list array for `filter_array_indices` ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter edited a comment on pull request #1477: Test and fix for bug writing a null list.

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #1477:
URL: https://github.com/apache/arrow-rs/pull/1477#issuecomment-1075921016


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1477?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1477](https://codecov.io/gh/apache/arrow-rs/pull/1477?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (290f615) into [master](https://codecov.io/gh/apache/arrow-rs/commit/d384836296bdf4694742c03bb51abbd87589aed6?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (d384836) will **increase** coverage by `0.00%`.
   > The diff coverage is `94.11%`.
   
   ```diff
   @@           Coverage Diff           @@
   ##           master    #1477   +/-   ##
   =======================================
     Coverage   82.70%   82.70%           
   =======================================
     Files         187      187           
     Lines       54208    54224   +16     
   =======================================
   + Hits        44832    44848   +16     
     Misses       9376     9376           
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1477?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [parquet/src/arrow/levels.rs](https://codecov.io/gh/apache/arrow-rs/pull/1477/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvbGV2ZWxzLnJz) | `84.71% <66.66%> (+0.03%)` | :arrow_up: |
   | [parquet/src/arrow/arrow\_writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/1477/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyb3dfd3JpdGVyLnJz) | `97.60% <100.00%> (+0.04%)` | :arrow_up: |
   | [parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow-rs/pull/1477/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZW5jb2RpbmdzL2VuY29kaW5nLnJz) | `93.37% <0.00%> (-0.19%)` | :arrow_down: |
   | [arrow/src/datatypes/datatype.rs](https://codecov.io/gh/apache/arrow-rs/pull/1477/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9kYXRhdHlwZS5ycw==) | `66.80% <0.00%> (+0.39%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1477?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1477?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [d384836...290f615](https://codecov.io/gh/apache/arrow-rs/pull/1477?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] novemberkilo commented on a change in pull request #1477: Test and fix for bug writing a null list.

Posted by GitBox <gi...@apache.org>.
novemberkilo commented on a change in pull request #1477:
URL: https://github.com/apache/arrow-rs/pull/1477#discussion_r832875425



##########
File path: parquet/src/arrow/levels.rs
##########
@@ -1822,4 +1821,22 @@ mod tests {
         // filter_array_indices should return the same indices in this case.
         assert_eq!(level1.filter_array_indices(), level2.filter_array_indices());
     }
+
+    #[test]
+    fn test_filter_indices_for_lists() {

Review comment:
       I don't understand the machinery of levels so I don't know if I have made a nonsense test here or not. It passes ... but I don't know if that means anything yet. Does it seem sensible? // @viirya 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter edited a comment on pull request #1477: Test and fix for bug writing a null list.

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #1477:
URL: https://github.com/apache/arrow-rs/pull/1477#issuecomment-1075921016


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1477?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1477](https://codecov.io/gh/apache/arrow-rs/pull/1477?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (1948d4f) into [master](https://codecov.io/gh/apache/arrow-rs/commit/d384836296bdf4694742c03bb51abbd87589aed6?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (d384836) will **increase** coverage by `0.00%`.
   > The diff coverage is `96.15%`.
   
   ```diff
   @@           Coverage Diff           @@
   ##           master    #1477   +/-   ##
   =======================================
     Coverage   82.70%   82.70%           
   =======================================
     Files         187      187           
     Lines       54208    54233   +25     
   =======================================
   + Hits        44832    44855   +23     
   - Misses       9376     9378    +2     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1477?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [parquet/src/arrow/levels.rs](https://codecov.io/gh/apache/arrow-rs/pull/1477/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvbGV2ZWxzLnJz) | `84.88% <91.66%> (+0.21%)` | :arrow_up: |
   | [parquet/src/arrow/arrow\_writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/1477/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyb3dfd3JpdGVyLnJz) | `97.60% <100.00%> (+0.04%)` | :arrow_up: |
   | [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1477/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `65.98% <0.00%> (-0.23%)` | :arrow_down: |
   | [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/1477/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `86.35% <0.00%> (-0.23%)` | :arrow_down: |
   | [arrow/src/datatypes/datatype.rs](https://codecov.io/gh/apache/arrow-rs/pull/1477/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9kYXRhdHlwZS5ycw==) | `66.80% <0.00%> (+0.39%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1477?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1477?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [d384836...1948d4f](https://codecov.io/gh/apache/arrow-rs/pull/1477?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a change in pull request #1477: Test and fix for bug writing a null list.

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1477:
URL: https://github.com/apache/arrow-rs/pull/1477#discussion_r833577155



##########
File path: parquet/src/arrow/levels.rs
##########
@@ -1822,4 +1821,22 @@ mod tests {
         // filter_array_indices should return the same indices in this case.
         assert_eq!(level1.filter_array_indices(), level2.filter_array_indices());
     }
+
+    #[test]
+    fn test_filter_indices_for_lists() {

Review comment:
       FWIW if you're interested in understanding what these repetition levels actually are there is a good explanation of how the parquet record shredding works [here](https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet).
   
   I also attempted to explain to myself what this function is doing which may be helpful - https://github.com/apache/arrow-rs/pull/1185/files#r785357233




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a change in pull request #1477: Test and fix for bug writing a null list.

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1477:
URL: https://github.com/apache/arrow-rs/pull/1477#discussion_r833585589



##########
File path: parquet/src/arrow/levels.rs
##########
@@ -757,7 +757,10 @@ impl LevelInfo {
 
     /// Given a level's information, calculate the offsets required to index an array correctly.
     pub(crate) fn filter_array_indices(&self) -> Vec<usize> {

Review comment:
       This function only makes sense for writing leaf arrays, and so I worry that whilst this may fix the case of a List of nulls, it will behave erratically if given a list of non-nulls. I'll see if I can work out why this method is getting called at all




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] viirya commented on a change in pull request #1477: Test and fix for bug writing a null list.

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #1477:
URL: https://github.com/apache/arrow-rs/pull/1477#discussion_r832860836



##########
File path: parquet/src/arrow/levels.rs
##########
@@ -757,7 +757,10 @@ impl LevelInfo {
 
     /// Given a level's information, calculate the offsets required to index an array correctly.
     pub(crate) fn filter_array_indices(&self) -> Vec<usize> {
-        if !matches!(self.level_type, LevelType::Primitive(_)) {
+        if !matches!(
+            self.level_type,
+            LevelType::Primitive(_) | LevelType::List(_)
+        ) {
             panic!(
                 "Cannot filter indices on a non-primitive array, found {:?}",

Review comment:
       ```suggestion
                   "Cannot filter indices on a non-primitive array and non-list array, found {:?}",
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] novemberkilo commented on pull request #1477: Test and fix for bug writing a null list.

Posted by GitBox <gi...@apache.org>.
novemberkilo commented on pull request #1477:
URL: https://github.com/apache/arrow-rs/pull/1477#issuecomment-1077422508


   Thanks folks for engaging with this. Looks like the attention and discussion is flushing out the correct fix. I am relatively new to this project and while this is excellent at helping me learn, I wonder if this issue is best left to @tustvold and @viirya to tackle correctly. Happy to close this PR in favour of their work. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on pull request #1477: Test and fix for bug writing a null list.

Posted by GitBox <gi...@apache.org>.
tustvold commented on pull request #1477:
URL: https://github.com/apache/arrow-rs/pull/1477#issuecomment-1077512488


   Ok I've filed https://github.com/apache/arrow-rs/pull/1481 which I think should fix the underlying bugs at play here. Thanks again @novemberkilo for reporting and drawing attention to this issue :+1:
   
   FWIW if you're interested in helping flesh out the rather limited support for nested types I would be more than happy to assist with filing tickets, reviews, etc... There is definitely a large amount of improvement that could be made in this area, what we support, what we test, etc... and I would be more than happy to assist where possible, just let me know :smile: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a change in pull request #1477: Test and fix for bug writing a null list.

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1477:
URL: https://github.com/apache/arrow-rs/pull/1477#discussion_r833585589



##########
File path: parquet/src/arrow/levels.rs
##########
@@ -757,7 +757,10 @@ impl LevelInfo {
 
     /// Given a level's information, calculate the offsets required to index an array correctly.
     pub(crate) fn filter_array_indices(&self) -> Vec<usize> {

Review comment:
       This function only makes sense for writing leaf arrays, and so I worry that whilst this may fix the case of a List of nulls, it will behave erratically if given a list of non-nulls. I'll see if I can work out why this method is getting called at all
   
   Edit: Well this sure looks suspect - https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/levels.rs#L203




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a change in pull request #1477: Test and fix for bug writing a null list.

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1477:
URL: https://github.com/apache/arrow-rs/pull/1477#discussion_r833638633



##########
File path: parquet/src/arrow/levels.rs
##########
@@ -757,7 +757,10 @@ impl LevelInfo {
 
     /// Given a level's information, calculate the offsets required to index an array correctly.
     pub(crate) fn filter_array_indices(&self) -> Vec<usize> {

Review comment:
       Ok I have a fix but whilst writing this I've found what appears to be yet another reader side bug :disappointed: 
   
   Unfortunately it is EOD here, but I'll pick this back up first thing tomorrow (GMT) and see if I can't get a PR to fix both this bug and the one I've just discovered




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] novemberkilo closed pull request #1477: Test and fix for bug writing a null list.

Posted by GitBox <gi...@apache.org>.
novemberkilo closed pull request #1477:
URL: https://github.com/apache/arrow-rs/pull/1477


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] novemberkilo commented on pull request #1477: Test and fix for bug writing a null list.

Posted by GitBox <gi...@apache.org>.
novemberkilo commented on pull request #1477:
URL: https://github.com/apache/arrow-rs/pull/1477#issuecomment-1075912823


   Possibly of interest @viirya @sunchao @tustvold @alamb ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] viirya commented on a change in pull request #1477: Test and fix for bug writing a null list.

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #1477:
URL: https://github.com/apache/arrow-rs/pull/1477#discussion_r832858324



##########
File path: parquet/src/arrow/arrow_writer.rs
##########
@@ -1264,13 +1264,45 @@ mod tests {
         let values = Arc::new(Int32Array::from(vec![None; SMALL_SIZE]));
         one_column_roundtrip(values, true, Some(SMALL_SIZE / 2));
     }
+
     #[test]
     fn null_single_column() {
         let values = Arc::new(NullArray::new(SMALL_SIZE));
         one_column_roundtrip(values, true, Some(SMALL_SIZE / 2));
         // null arrays are always nullable, a test with non-nullable nulls fails
     }
 
+    #[test]
+    fn null_list_single_column() {
+        let null_field = Field::new("item", DataType::Null, true);
+        let list_field =
+            Field::new("emptylist", DataType::List(Box::new(null_field)), true);
+
+        let schema = Schema::new(vec![list_field]);
+
+        // Build a ListArray[NullArray(0)]
+        let a_values = NullArray::new(SMALL_SIZE);
+        let a_value_offsets = arrow::buffer::Buffer::from(&[0, 0].to_byte_slice());
+        let a_list_data = ArrayData::builder(DataType::List(Box::new(Field::new(
+            "item",
+            DataType::Null,
+            true,
+        ))))
+        .len(1)
+        .add_buffer(a_value_offsets)
+        .null_bit_buffer(Buffer::from(vec![0b00011011]))
+        .add_child_data(a_values.data().clone())
+        .build()
+        .unwrap();
+
+        let a = ListArray::from(a_list_data);
+        // let values = Arc::new(a);
+        // one_column_roundtrip(values, true, Some(SMALL_SIZE / 2));

Review comment:
       Do we need to remove these lines?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] viirya commented on a change in pull request #1477: Test and fix for bug writing a null list.

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #1477:
URL: https://github.com/apache/arrow-rs/pull/1477#discussion_r833943974



##########
File path: parquet/src/arrow/levels.rs
##########
@@ -1822,4 +1821,22 @@ mod tests {
         // filter_array_indices should return the same indices in this case.
         assert_eq!(level1.filter_array_indices(), level2.filter_array_indices());
     }
+
+    #[test]
+    fn test_filter_indices_for_lists() {

Review comment:
       Okay. I tried to study the repetition/definition level algorithm a little bit. Seems this issue is not at `filter_array_indices`, although the error is bumped out there.
   
   Seems we should not have `List` level info there for this case.
   
   It comes from `https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/levels.rs#L203`, we reuse the level info for the list. Looks suspect as @tustvold mentioned.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] viirya commented on a change in pull request #1477: Test and fix for bug writing a null list.

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #1477:
URL: https://github.com/apache/arrow-rs/pull/1477#discussion_r833489885



##########
File path: parquet/src/arrow/levels.rs
##########
@@ -1822,4 +1821,22 @@ mod tests {
         // filter_array_indices should return the same indices in this case.
         assert_eq!(level1.filter_array_indices(), level2.filter_array_indices());
     }
+
+    #[test]
+    fn test_filter_indices_for_lists() {

Review comment:
       I'm not familiar with definition/repetition level computation. I will take a look at the algorithm.
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] nevi-me commented on a change in pull request #1477: Test and fix for bug writing a null list.

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #1477:
URL: https://github.com/apache/arrow-rs/pull/1477#discussion_r833530557



##########
File path: parquet/src/arrow/levels.rs
##########
@@ -1822,4 +1821,22 @@ mod tests {
         // filter_array_indices should return the same indices in this case.
         assert_eq!(level1.filter_array_indices(), level2.filter_array_indices());
     }
+
+    #[test]
+    fn test_filter_indices_for_lists() {

Review comment:
       That's my fault on the complexity here. Let's see if I can interpret the `LevelInfo`.
   
   A nullable list will have 3 levels if its child array is also nullable.
   
   So with the def levels:
   * 0: null list value `null`
   * 1: empty list `[]`
   * 2: list with null `[null]`
   * 3: list with value `[1]`
   
   Here's a comment earlier in this file (https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/levels.rs#L402-L407). In this case, `l1 = 0`, `l2 = 1`, `l3 = 2`. With `l3`, the actual emptiness of the value (`null` or `1` above) gets determined by the child.
   
   If the child is a non-null array, then `def = 3` isn't needed as it'd always be true, so we'd only have `def.max() == 2`. With the same pattern, if the list is non-null, then we only need 2 levels.
   
   * 0: empty list
   * 1: list with value
   
   The repetition levels let us know how nested the list is, so 0 is the start of a new list item, and 1 is a continuation.
   
   If we have repetition like: `[0, 1, 2, 1, 0, 2]` plus its definitions, we can both construct the offsets (meaning that `LevelInfo` is inefficient as we don't really need its `array_offsets`) as follows:
   
   Offsets = every value that's 0 is a new record, but if there's a 0, we have to check if the definition has an empty value for that record.
   
   So we start with 2 list values:
   
   ```
   [0, 1, 2, 1]
   [0, 2]
   ```
   
   Then we expand further to:
   
   ```
   [ [0], [1, 2], [1] ] -> 3 nested values in the list slot
   [ [0, 2] ] -> 1 nested value in the list slot
   ```
   
   Using the offsets on your test (with reps shown):
   
   ```
   1.  [0, 1, 2] (rep: [0, 1, 1])
   2. []            (rep: [0])
   3. [3, 4, 5] (rep: [0, 1, 1])
   ```
   
   So in this case `filter_array_indices` is giving us the indices that have values, hence the [0, 1, 2, 3, 4, 5] result.
   _______
   
   So maybe a good approach for null list values is to add definition = 2. I'd expect whichever index that's got a def of 2 or less to be excluded. 
   
   I'll be able to have a look at this PR tomorrow morning GMT, so I can tweak it or answer questions in detail then if you don't mind @novemberkilo 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] novemberkilo commented on pull request #1477: Test and fix for bug writing a null list.

Posted by GitBox <gi...@apache.org>.
novemberkilo commented on pull request #1477:
URL: https://github.com/apache/arrow-rs/pull/1477#issuecomment-1078457400


   Closing in favour of #1481 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org