You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/24 10:57:25 UTC

[GitHub] [arrow-rs] tustvold opened a new pull request #1481: Fix reading/writing nested null arrays (#1480) (#1036) (#1399)

tustvold opened a new pull request #1481:
URL: https://github.com/apache/arrow-rs/pull/1481


   # Which issue does this PR close?
   
   Closes #1480 
   Closes #1036 
   Closes #1399
   
   # Rationale for this change
   
   See tickets
   
   # What changes are included in this PR?
   
   Fixes various bugs in handling of null arrays
   
   # Are there any user-facing changes?
   
   Previous incorrect behaviour is now fixed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on pull request #1481: Fix reading/writing nested null arrays (#1480) (#1036) (#1399)

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1481:
URL: https://github.com/apache/arrow-rs/pull/1481#issuecomment-1079207904


   Thanks all@


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a change in pull request #1481: Fix reading/writing nested null arrays (#1480) (#1036) (#1399)

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1481:
URL: https://github.com/apache/arrow-rs/pull/1481#discussion_r834503880



##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -887,7 +886,7 @@ fn remove_indices(
                 Ok(Arc::new(StructArray::from((new_columns, valid.finish()))))
             }
         }
-        ArrowType::Null => Ok(Arc::new(NullArray::new(arr.len()))),
+        ArrowType::Null => Ok(Arc::new(NullArray::new(arr.len() - indices.len()))),

Review comment:
       That was caused by https://github.com/apache/arrow-rs/pull/1448#discussion_r831229793 if I'm not mistaken




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] viirya commented on a change in pull request #1481: Fix reading/writing nested null arrays (#1480) (#1036) (#1399)

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #1481:
URL: https://github.com/apache/arrow-rs/pull/1481#discussion_r834502380



##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -887,7 +886,7 @@ fn remove_indices(
                 Ok(Arc::new(StructArray::from((new_columns, valid.finish()))))
             }
         }
-        ArrowType::Null => Ok(Arc::new(NullArray::new(arr.len()))),
+        ArrowType::Null => Ok(Arc::new(NullArray::new(arr.len() - indices.len()))),

Review comment:
       Hmm, I remember by deducting the null indices, the resulting record batch will be empty, but actually there are a row with empty list in previous issue.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] viirya commented on a change in pull request #1481: Fix reading/writing nested null arrays (#1480) (#1036) (#1399)

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #1481:
URL: https://github.com/apache/arrow-rs/pull/1481#discussion_r834504180



##########
File path: parquet/src/arrow/levels.rs
##########
@@ -200,9 +200,8 @@ impl LevelInfo {
                 );
 
                 match child_array.data_type() {
-                    // TODO: The behaviour of a <list<null>> is untested
-                    DataType::Null => vec![list_level],
-                    DataType::Boolean
+                    DataType::Null

Review comment:
       yea, i think this is correct.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb closed pull request #1481: Fix reading/writing nested null arrays (#1480) (#1036) (#1399)

Posted by GitBox <gi...@apache.org>.
alamb closed pull request #1481:
URL: https://github.com/apache/arrow-rs/pull/1481


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a change in pull request #1481: Fix reading/writing nested null arrays (#1480) (#1036) (#1399)

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1481:
URL: https://github.com/apache/arrow-rs/pull/1481#discussion_r834179677



##########
File path: parquet/src/arrow/arrow_writer.rs
##########
@@ -1564,6 +1601,25 @@ mod tests {
         one_column_roundtrip(values, true, Some(SMALL_SIZE / 2));
     }
 
+    #[test]
+    fn list_nested_nulls() {

Review comment:
       This test was just something I wrote whilst exploring the problem space, and I think is generally useful

##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -887,7 +886,7 @@ fn remove_indices(
                 Ok(Arc::new(StructArray::from((new_columns, valid.finish()))))
             }
         }
-        ArrowType::Null => Ok(Arc::new(NullArray::new(arr.len()))),
+        ArrowType::Null => Ok(Arc::new(NullArray::new(arr.len() - indices.len()))),

Review comment:
       The original version of #1448 was this, which is correct, it was then changed in a subsequent follow up [commit](https://github.com/apache/arrow-rs/pull/1448/commits/acd8d47ddda45598b21b68cbd6c16849ae79e42e#diff-0d6bed48d78c5a2472b7680a8185cabdc0bd259d6484e184439ed7830060661fL888). I'm not sure why...
   
   FYI @viirya 

##########
File path: parquet/src/arrow/levels.rs
##########
@@ -677,8 +676,9 @@ impl LevelInfo {
         len: usize,
     ) -> (Vec<i64>, Vec<bool>) {
         match array.data_type() {
-            DataType::Null
-            | DataType::Boolean
+            // A NullArray is entirely nulls, despite not containing a null buffer
+            DataType::Null => ((0..=(len as i64)).collect(), vec![false; len]),

Review comment:
       This is the fix for #1480 

##########
File path: parquet/src/arrow/levels.rs
##########
@@ -200,9 +200,8 @@ impl LevelInfo {
                 );
 
                 match child_array.data_type() {
-                    // TODO: The behaviour of a <list<null>> is untested
-                    DataType::Null => vec![list_level],
-                    DataType::Boolean
+                    DataType::Null

Review comment:
       This is the fix for #1036 

##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -225,11 +225,10 @@ where
 
     /// Reads at most `batch_size` records into array.
     fn next_batch(&mut self, batch_size: usize) -> Result<ArrayRef> {
-        let records_read =
-            read_records(&mut self.record_reader, self.pages.as_mut(), batch_size)?;
+        read_records(&mut self.record_reader, self.pages.as_mut(), batch_size)?;
 
         // convert to arrays
-        let array = arrow::array::NullArray::new(records_read);
+        let array = arrow::array::NullArray::new(self.record_reader.num_values());

Review comment:
       This was a bug, the NullArray should have the value count number of elements, i.e. the same number of elements as the definition/repetition levels if any. 
   
   In the non-nested case the two quantities will be the same, but in the event of a nested array a single record might correspond to 1 or more values.
   
   This would then cause this line to fire - https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/array_reader.rs#L930

##########
File path: parquet/src/arrow/arrow_writer.rs
##########
@@ -1513,6 +1513,43 @@ mod tests {
         required_and_optional::<LargeStringArray, _>(raw_strs);
     }
 
+    #[test]
+    fn null_list_single_column() {

Review comment:
       This is an adapted version of the test added by @novemberkilo in https://github.com/apache/arrow-rs/pull/1477/files#diff-92fdd8b1e0a40e7072901fe74855da759d2d97f02c4f7224f4217468d545ecacR1276




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter commented on pull request #1481: Fix reading/writing nested null arrays (#1480) (#1036) (#1399)

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #1481:
URL: https://github.com/apache/arrow-rs/pull/1481#issuecomment-1077515106


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1481?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1481](https://codecov.io/gh/apache/arrow-rs/pull/1481?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (a1e37df) into [master](https://codecov.io/gh/apache/arrow-rs/commit/e778c104280cfe1b8c22615f894a95ec37564302?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (e778c10) will **decrease** coverage by `0.01%`.
   > The diff coverage is `92.30%`.
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #1481      +/-   ##
   ==========================================
   - Coverage   82.72%   82.71%   -0.02%     
   ==========================================
     Files         187      187              
     Lines       54186    54239      +53     
   ==========================================
   + Hits        44824    44862      +38     
   - Misses       9362     9377      +15     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1481?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [parquet/src/arrow/levels.rs](https://codecov.io/gh/apache/arrow-rs/pull/1481/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvbGV2ZWxzLnJz) | `84.80% <25.00%> (+0.12%)` | :arrow_up: |
   | [parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1481/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyLnJz) | `78.79% <100.00%> (+0.06%)` | :arrow_up: |
   | [parquet/src/arrow/arrow\_writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/1481/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyb3dfd3JpdGVyLnJz) | `97.65% <100.00%> (+0.09%)` | :arrow_up: |
   | [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1481/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `65.75% <0.00%> (-0.46%)` | :arrow_down: |
   | [arrow/src/array/data.rs](https://codecov.io/gh/apache/arrow-rs/pull/1481/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2RhdGEucnM=) | `82.80% <0.00%> (-0.16%)` | :arrow_down: |
   | [integration-testing/src/lib.rs](https://codecov.io/gh/apache/arrow-rs/pull/1481/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aW50ZWdyYXRpb24tdGVzdGluZy9zcmMvbGliLnJz) | `0.00% <0.00%> (ø)` | |
   | [arrow/src/array/array\_binary.rs](https://codecov.io/gh/apache/arrow-rs/pull/1481/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2FycmF5X2JpbmFyeS5ycw==) | `92.62% <0.00%> (ø)` | |
   | [arrow/src/array/array\_dictionary.rs](https://codecov.io/gh/apache/arrow-rs/pull/1481/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2FycmF5X2RpY3Rpb25hcnkucnM=) | `91.91% <0.00%> (+0.24%)` | :arrow_up: |
   | [arrow/src/datatypes/field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1481/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9maWVsZC5ycw==) | `54.10% <0.00%> (+0.30%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1481?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1481?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [e778c10...a1e37df](https://codecov.io/gh/apache/arrow-rs/pull/1481?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb edited a comment on pull request #1481: Fix reading/writing nested null arrays (#1480) (#1036) (#1399)

Posted by GitBox <gi...@apache.org>.
alamb edited a comment on pull request #1481:
URL: https://github.com/apache/arrow-rs/pull/1481#issuecomment-1079207904


   Thanks all!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] viirya commented on a change in pull request #1481: Fix reading/writing nested null arrays (#1480) (#1036) (#1399)

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #1481:
URL: https://github.com/apache/arrow-rs/pull/1481#discussion_r834506221



##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -887,7 +886,7 @@ fn remove_indices(
                 Ok(Arc::new(StructArray::from((new_columns, valid.finish()))))
             }
         }
-        ArrowType::Null => Ok(Arc::new(NullArray::new(arr.len()))),
+        ArrowType::Null => Ok(Arc::new(NullArray::new(arr.len() - indices.len()))),

Review comment:
       Maybe it is affected by https://github.com/apache/arrow-rs/pull/1481/files#diff-0d6bed48d78c5a2472b7680a8185cabdc0bd259d6484e184439ed7830060661fR231?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org