You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/15 18:50:06 UTC

[GitHub] [arrow-rs] helgikrs opened a new pull request #1185: fix: Fix a bug in how filter indices are calculated

helgikrs opened a new pull request #1185:
URL: https://github.com/apache/arrow-rs/pull/1185


   # Which issue does this PR close?
   
   Closes #1184.
   
   # What changes are included in this PR?
   
   Using the definition level and the nullability of the column only produces the
   correct indices if `max_definition - 1` is the list level.  For deeper nesting
   (struct in a list) this produces incorrect indices, silently causing incorrect
   data to be written.
   
   This PR fixes the bug by using the array offsets to compute the indices
   instead as well as a regression test.
   
   # Are there any user-facing changes?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] helgikrs commented on a change in pull request #1185: fix: Fix a bug in how filter indices are calculated

Posted by GitBox <gi...@apache.org>.
helgikrs commented on a change in pull request #1185:
URL: https://github.com/apache/arrow-rs/pull/1185#discussion_r785360768



##########
File path: parquet/src/arrow/levels.rs
##########
@@ -780,17 +773,25 @@ impl LevelInfo {
                 })
                 .collect();
         }
+
         let mut filtered = vec![];
-        // remove slots that are false from definition_mask
+        let mut definition_levels = self.definition.iter();
         let mut index = 0;
-        self.definition.iter().for_each(|def| {
-            if *def == self.max_definition {
-                filtered.push(index);
-            }
-            if *def >= self.max_definition - is_nullable as i16 {
-                index += 1;
+
+        for len in self.array_offsets.windows(2).map(|s| s[1] - s[0]) {
+            if len == 0 {
+                // Skip this definition level (it should be less than max_definition)

Review comment:
       Added an assert :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter edited a comment on pull request #1185: fix: Fix a bug in how filter indices are calculated

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #1185:
URL: https://github.com/apache/arrow-rs/pull/1185#issuecomment-1013736732


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1185?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1185](https://codecov.io/gh/apache/arrow-rs/pull/1185?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (8714ad5) into [master](https://codecov.io/gh/apache/arrow-rs/commit/66b84f37e18e4ba5f9b3463f84f4bbbefb6c4341?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (66b84f3) will **increase** coverage by `0.00%`.
   > The diff coverage is `95.45%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/1185/graphs/tree.svg?width=650&height=150&src=pr&token=pq9V9qWZ1N&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/1185?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@           Coverage Diff           @@
   ##           master    #1185   +/-   ##
   =======================================
     Coverage   82.66%   82.67%           
   =======================================
     Files         173      173           
     Lines       50902    50916   +14     
   =======================================
   + Hits        42077    42093   +16     
   + Misses       8825     8823    -2     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1185?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [parquet/src/arrow/levels.rs](https://codecov.io/gh/apache/arrow-rs/pull/1185/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvbGV2ZWxzLnJz) | `84.56% <95.45%> (+0.28%)` | :arrow_up: |
   | [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/1185/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `85.43% <0.00%> (-0.14%)` | :arrow_down: |
   | [arrow/src/buffer/immutable.rs](https://codecov.io/gh/apache/arrow-rs/pull/1185/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2J1ZmZlci9pbW11dGFibGUucnM=) | `98.92% <0.00%> (ø)` | |
   | [arrow/src/datatypes/native.rs](https://codecov.io/gh/apache/arrow-rs/pull/1185/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9uYXRpdmUucnM=) | `66.66% <0.00%> (ø)` | |
   | [parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow-rs/pull/1185/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZW5jb2RpbmdzL2VuY29kaW5nLnJz) | `93.71% <0.00%> (+0.19%)` | :arrow_up: |
   | [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1185/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `66.43% <0.00%> (+0.45%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1185?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1185?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [66b84f3...8714ad5](https://codecov.io/gh/apache/arrow-rs/pull/1185?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a change in pull request #1185: fix: Fix a bug in how filter indices are calculated

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1185:
URL: https://github.com/apache/arrow-rs/pull/1185#discussion_r785352427



##########
File path: parquet/src/arrow/levels.rs
##########
@@ -759,13 +759,6 @@ impl LevelInfo {
     /// Given a level's information, calculate the offsets required to index an array correctly.
     pub(crate) fn filter_array_indices(&self) -> Vec<usize> {
         // happy path if not dealing with lists
-        let is_nullable = match self.level_type {
-            LevelType::Primitive(is_nullable) => is_nullable,
-            _ => panic!(

Review comment:
       I think this assertion is still important - this method will return nonsense if called on something that isn't a leaf

##########
File path: parquet/src/arrow/levels.rs
##########
@@ -780,17 +773,25 @@ impl LevelInfo {
                 })
                 .collect();
         }
+
         let mut filtered = vec![];
-        // remove slots that are false from definition_mask
+        let mut definition_levels = self.definition.iter();
         let mut index = 0;
-        self.definition.iter().for_each(|def| {
-            if *def == self.max_definition {
-                filtered.push(index);
-            }
-            if *def >= self.max_definition - is_nullable as i16 {
-                index += 1;
+
+        for len in self.array_offsets.windows(2).map(|s| s[1] - s[0]) {

Review comment:
       I'm not hugely familiar with this code, but the below is my attempt to describe what is going on here.
   
   The parquet writer iterates through all the leaf arrays in a potentially nested arrow array, producing a list of `LevelInfo` that it then writes. Each `LevelInfo` identifies for that leaf, what repetition and definition levels to write. This method is then used to compute which non-null indices from the source primitive array to write that correspond to this information.
   
   It therefore iterates through the definition levels and where `def == self.max_definition` it records that index as one that needs to be written to parquet. 
   
   The problem with the previous logic was it assumed that only a null or value at the leaf level, i.e. `def >= max_def - is_nullable` would increase the index position in the source primitive array. 
   
   Whilst this is the case for a parent nullable ListArray, it is not for a parent nullable StructArray. Critically, **a null in a ListArray may not take up space in its child array, but a null in a StructArray does**. 
   
   By using the `array_offsets` instead of the nesting level, we can handle both cases by distinguishing list array nulls that don't take up space  :+1:




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a change in pull request #1185: fix: Fix a bug in how filter indices are calculated

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1185:
URL: https://github.com/apache/arrow-rs/pull/1185#discussion_r785357233



##########
File path: parquet/src/arrow/levels.rs
##########
@@ -780,17 +773,25 @@ impl LevelInfo {
                 })
                 .collect();
         }
+
         let mut filtered = vec![];
-        // remove slots that are false from definition_mask
+        let mut definition_levels = self.definition.iter();
         let mut index = 0;
-        self.definition.iter().for_each(|def| {
-            if *def == self.max_definition {
-                filtered.push(index);
-            }
-            if *def >= self.max_definition - is_nullable as i16 {
-                index += 1;
+
+        for len in self.array_offsets.windows(2).map(|s| s[1] - s[0]) {

Review comment:
       I'm not hugely familiar with this code, but the below is my attempt to describe what is going on here.
   
   The parquet writer iterates through all the leaf arrays in a potentially nested arrow array, producing a list of `LevelInfo` that it then writes. Each `LevelInfo` identifies for that leaf, what repetition and definition levels to write. This method is then used to compute which non-null indices from the source primitive array to write that correspond to this information.
   
   It therefore iterates through the definition levels and where `def == self.max_definition` it records that index as one that needs to be written to parquet. 
   
   The problem with the previous logic was it assumed that only a null or value at the leaf level, i.e. `def >= max_def - is_nullable` would increase the index position in the source primitive array. 
   
   Whilst this is the case for a parent nullable ListArray, it is not for a parent nullable StructArray. Critically, **a null in a ListArray may not take up space in its child array, but a null in a StructArray does**. 
   
   By using the `array_offsets` instead of the nesting level, we can handle both cases by distinguishing list array nulls that don't take up space (in addition to those that take up more than one slot as is technically legal) :+1:




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a change in pull request #1185: fix: Fix a bug in how filter indices are calculated

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1185:
URL: https://github.com/apache/arrow-rs/pull/1185#discussion_r785357479



##########
File path: parquet/src/arrow/levels.rs
##########
@@ -780,17 +773,25 @@ impl LevelInfo {
                 })
                 .collect();
         }
+
         let mut filtered = vec![];
-        // remove slots that are false from definition_mask
+        let mut definition_levels = self.definition.iter();
         let mut index = 0;
-        self.definition.iter().for_each(|def| {
-            if *def == self.max_definition {
-                filtered.push(index);
-            }
-            if *def >= self.max_definition - is_nullable as i16 {
-                index += 1;
+
+        for len in self.array_offsets.windows(2).map(|s| s[1] - s[0]) {
+            if len == 0 {
+                // Skip this definition level (it should be less than max_definition)

Review comment:
       Perhaps this code should verify that it is indeed a null? :thinking: 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a change in pull request #1185: fix: Fix a bug in how filter indices are calculated

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1185:
URL: https://github.com/apache/arrow-rs/pull/1185#discussion_r785357233



##########
File path: parquet/src/arrow/levels.rs
##########
@@ -780,17 +773,25 @@ impl LevelInfo {
                 })
                 .collect();
         }
+
         let mut filtered = vec![];
-        // remove slots that are false from definition_mask
+        let mut definition_levels = self.definition.iter();
         let mut index = 0;
-        self.definition.iter().for_each(|def| {
-            if *def == self.max_definition {
-                filtered.push(index);
-            }
-            if *def >= self.max_definition - is_nullable as i16 {
-                index += 1;
+
+        for len in self.array_offsets.windows(2).map(|s| s[1] - s[0]) {

Review comment:
       I'm not hugely familiar with this code, but the below is my attempt to describe what is going on here.
   
   The parquet writer iterates through all the leaf arrays in a potentially nested arrow array, producing a list of `LevelInfo` that it then writes. Each `LevelInfo` identifies for that leaf, what repetition and definition levels to write. This method is then used to compute which non-null indices from the source primitive array to write that correspond to this information.
   
   It therefore iterates through the definition levels and where `def == self.max_definition` it records that index as one that needs to be written to parquet. 
   
   The problem with the previous logic was it assumed that only a null or value at the leaf level, i.e. `def >= max_def - is_nullable` would increase the index position in the source primitive array. 
   
   Whilst this is the case for a parent nullable ListArray, it is not for a parent nullable StructArray. Critically, **a null in a ListArray may not take up space in its child array, but a null in a StructArray does**. 
   
   By using the `array_offsets` instead of the nesting level, we can continue to handle list array nulls that don't take up space, whilst handling StructArray nulls that do. :+1:
   
   _As an added bonus I think this fixes a related bug, where a ListArray with a non-zero length null slot would break for similar reasons_




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter commented on pull request #1185: fix: Fix a bug in how filter indices are calculated

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #1185:
URL: https://github.com/apache/arrow-rs/pull/1185#issuecomment-1013736732


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1185?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1185](https://codecov.io/gh/apache/arrow-rs/pull/1185?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (17fa865) into [master](https://codecov.io/gh/apache/arrow-rs/commit/66b84f37e18e4ba5f9b3463f84f4bbbefb6c4341?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (66b84f3) will **increase** coverage by `0.00%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/1185/graphs/tree.svg?width=650&height=150&src=pr&token=pq9V9qWZ1N&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/1185?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@           Coverage Diff           @@
   ##           master    #1185   +/-   ##
   =======================================
     Coverage   82.66%   82.66%           
   =======================================
     Files         173      173           
     Lines       50902    50913   +11     
   =======================================
   + Hits        42077    42088   +11     
     Misses       8825     8825           
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1185?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [parquet/src/arrow/levels.rs](https://codecov.io/gh/apache/arrow-rs/pull/1185/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvbGV2ZWxzLnJz) | `84.76% <100.00%> (+0.47%)` | :arrow_up: |
   | [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1185/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `65.75% <0.00%> (-0.23%)` | :arrow_down: |
   | [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/1185/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `85.43% <0.00%> (-0.14%)` | :arrow_down: |
   | [arrow/src/buffer/immutable.rs](https://codecov.io/gh/apache/arrow-rs/pull/1185/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2J1ZmZlci9pbW11dGFibGUucnM=) | `98.92% <0.00%> (ø)` | |
   | [arrow/src/datatypes/native.rs](https://codecov.io/gh/apache/arrow-rs/pull/1185/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9uYXRpdmUucnM=) | `66.66% <0.00%> (ø)` | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1185?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1185?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [66b84f3...17fa865](https://codecov.io/gh/apache/arrow-rs/pull/1185?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a change in pull request #1185: fix: Fix a bug in how filter indices are calculated

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1185:
URL: https://github.com/apache/arrow-rs/pull/1185#discussion_r785357233



##########
File path: parquet/src/arrow/levels.rs
##########
@@ -780,17 +773,25 @@ impl LevelInfo {
                 })
                 .collect();
         }
+
         let mut filtered = vec![];
-        // remove slots that are false from definition_mask
+        let mut definition_levels = self.definition.iter();
         let mut index = 0;
-        self.definition.iter().for_each(|def| {
-            if *def == self.max_definition {
-                filtered.push(index);
-            }
-            if *def >= self.max_definition - is_nullable as i16 {
-                index += 1;
+
+        for len in self.array_offsets.windows(2).map(|s| s[1] - s[0]) {

Review comment:
       I'm not hugely familiar with this code, but the below is my attempt to describe what is going on here.
   
   The parquet writer iterates through all the leaf arrays in a potentially nested arrow array, producing a list of `LevelInfo` that it then writes. Each `LevelInfo` identifies for that leaf, what repetition and definition levels to write. This method is then used to compute which non-null indices from the source primitive array to write that correspond to this information.
   
   It therefore iterates through the definition levels and where `def == self.max_definition` it records that index as one that needs to be written to parquet. 
   
   The problem with the previous logic was it assumed that only a null or value at the leaf level, i.e. `def >= max_def - is_nullable` would increase the index position in the source primitive array. 
   
   Whilst this is the case for a parent nullable ListArray, it is not for a parent nullable StructArray. Critically, **a null in a ListArray may not take up space in its child array, but a null in a StructArray does**. 
   
   By using the `array_offsets` instead of the nesting level, we can handle both cases by distinguishing list array nulls that don't take up space (in addition to those that take up more than one index which the spec allows) :+1:

##########
File path: parquet/src/arrow/levels.rs
##########
@@ -780,17 +773,25 @@ impl LevelInfo {
                 })
                 .collect();
         }
+
         let mut filtered = vec![];
-        // remove slots that are false from definition_mask
+        let mut definition_levels = self.definition.iter();
         let mut index = 0;
-        self.definition.iter().for_each(|def| {
-            if *def == self.max_definition {
-                filtered.push(index);
-            }
-            if *def >= self.max_definition - is_nullable as i16 {
-                index += 1;
+
+        for len in self.array_offsets.windows(2).map(|s| s[1] - s[0]) {

Review comment:
       I'm not hugely familiar with this code, but the below is my attempt to describe what is going on here.
   
   The parquet writer iterates through all the leaf arrays in a potentially nested arrow array, producing a list of `LevelInfo` that it then writes. Each `LevelInfo` identifies for that leaf, what repetition and definition levels to write. This method is then used to compute which non-null indices from the source primitive array to write that correspond to this information.
   
   It therefore iterates through the definition levels and where `def == self.max_definition` it records that index as one that needs to be written to parquet. 
   
   The problem with the previous logic was it assumed that only a null or value at the leaf level, i.e. `def >= max_def - is_nullable` would increase the index position in the source primitive array. 
   
   Whilst this is the case for a parent nullable ListArray, it is not for a parent nullable StructArray. Critically, **a null in a ListArray may not take up space in its child array, but a null in a StructArray does**. 
   
   By using the `array_offsets` instead of the nesting level, we can handle both cases by distinguishing list array nulls that don't take up space (in addition to those that take up more than one index, which the spec allows) :+1:




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] helgikrs commented on a change in pull request #1185: fix: Fix a bug in how filter indices are calculated

Posted by GitBox <gi...@apache.org>.
helgikrs commented on a change in pull request #1185:
URL: https://github.com/apache/arrow-rs/pull/1185#discussion_r785360678



##########
File path: parquet/src/arrow/levels.rs
##########
@@ -759,13 +759,6 @@ impl LevelInfo {
     /// Given a level's information, calculate the offsets required to index an array correctly.
     pub(crate) fn filter_array_indices(&self) -> Vec<usize> {
         // happy path if not dealing with lists
-        let is_nullable = match self.level_type {
-            LevelType::Primitive(is_nullable) => is_nullable,
-            _ => panic!(

Review comment:
       Makes sense. Added the assertion back.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb merged pull request #1185: fix: Fix a bug in how filter indices are calculated

Posted by GitBox <gi...@apache.org>.
alamb merged pull request #1185:
URL: https://github.com/apache/arrow-rs/pull/1185


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a change in pull request #1185: fix: Fix a bug in how filter indices are calculated

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1185:
URL: https://github.com/apache/arrow-rs/pull/1185#discussion_r785357233



##########
File path: parquet/src/arrow/levels.rs
##########
@@ -780,17 +773,25 @@ impl LevelInfo {
                 })
                 .collect();
         }
+
         let mut filtered = vec![];
-        // remove slots that are false from definition_mask
+        let mut definition_levels = self.definition.iter();
         let mut index = 0;
-        self.definition.iter().for_each(|def| {
-            if *def == self.max_definition {
-                filtered.push(index);
-            }
-            if *def >= self.max_definition - is_nullable as i16 {
-                index += 1;
+
+        for len in self.array_offsets.windows(2).map(|s| s[1] - s[0]) {

Review comment:
       I'm not hugely familiar with this code, but the below is my attempt to describe what is going on here.
   
   The parquet writer iterates through all the leaf arrays in a potentially nested arrow array, producing a list of `LevelInfo` that it then writes. Each `LevelInfo` identifies for that leaf, what repetition and definition levels to write. This method is then used to compute which non-null indices from the source primitive array to write that correspond to this information.
   
   It therefore iterates through the definition levels and where `def == self.max_definition` it records that index as one that needs to be written to parquet. 
   
   The problem with the previous logic was it assumed that only a null or value at the leaf level, i.e. `def >= max_def - is_nullable` would increase the index position in the source primitive array. 
   
   Whilst this is the case for a parent nullable ListArray, it is not for a parent nullable StructArray. Critically, **a null in a ListArray may not take up space in its child array, but a null in a StructArray does**. 
   
   By using the `array_offsets` instead of the nesting level, we can handle both cases by correctly handling list array nulls that don't take up space (in addition to those that take up more than one index, which the spec allows) :+1:




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org