You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/08 15:13:05 UTC

[GitHub] [arrow-rs] nevi-me opened a new pull request #270: Fix null struct and list roundtrip

nevi-me opened a new pull request #270:
URL: https://github.com/apache/arrow-rs/pull/270


   # Which issue does this PR close?
   
   Closes #245 .
   
    # Rationale for this change
   
   This addresses bugs in the Rust Parquet writer and reader, where we were:
   * Not reading lists correctly when they have null bitmaps vs not. We were creating null bitmaps even when a list was non-null.
   * Not incrementing definitions correctly for some combinations of lists and structs
   
   # What changes are included in this PR?
   
   This PR:
   
   * Fixes the reader, by making roundtrip tests pass under conditions that were previously failing (mostly when lists are set as non-nullable).
   * Combines a few loose variables into a `LevelType` enum, that has enough information about the Arrow types when calculating levels. This is a lighter solution that passing Arrow fields around when computing levels, and could allow us to reuse the levels logic elsewhere in the codebase.
   * Enables nullability conditions that were failing
   
   In working on this PR, I:
   * Wrote the rest recordbatch to an IPC file
   * Wrote the test recordbatch to parquet
   * Read the file with `pyarrow`, and wrote it to disk with `pyarrow.parquet`
   * Read both parquet files with `pyarrow.parquet`, and confirmed that the results were identical
   * Read both parquet files with `pyspark`, and confirmed that the results were identical
   
   An interesting observation is that `pyspark` always interpreted the parquet columns as all nullable.
   
   # Are there any user-facing changes?
   
   All changes are within crate-level structs
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] nevi-me commented on pull request #270: Fix null struct and list roundtrip

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #270:
URL: https://github.com/apache/arrow-rs/pull/270#issuecomment-835399207


   @alamb @jorgecarleitao this is a big one for me, as I've been able to verify that the writer does what it's supposed to do with mixed nested types.
   I think the `LevelInfo` logic is now solid, and I should be able to start optimising it, like removing `Vec<bool>` and replacing it with bitmaps where it makes sense.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on pull request #270: Fix null struct and list roundtrip

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #270:
URL: https://github.com/apache/arrow-rs/pull/270#issuecomment-838247734


   Great job @nevi-me 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] nevi-me merged pull request #270: Fix null struct and list roundtrip

Posted by GitBox <gi...@apache.org>.
nevi-me merged pull request #270:
URL: https://github.com/apache/arrow-rs/pull/270


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on a change in pull request #270: Fix null struct and list roundtrip

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #270:
URL: https://github.com/apache/arrow-rs/pull/270#discussion_r629328631



##########
File path: parquet/src/arrow/levels.rs
##########
@@ -315,80 +314,40 @@ impl LevelInfo {
         // we use 64-bit offsets to also accommodate large arrays
         array_offsets: Vec<i64>,
         array_mask: Vec<bool>,
-        is_list: bool,
-        is_parent_struct: bool,
-        is_nullable: bool,
+        level_type: LevelType,
     ) -> Self {
         let min_len = *(array_offsets.last().unwrap()) as usize;
         let mut definition = Vec::with_capacity(min_len);
         let mut repetition = Vec::with_capacity(min_len);
         let mut merged_array_mask = Vec::with_capacity(min_len);
 
-        // determine the total level increment based on data types
-        let max_definition = match is_list {
-            false => {
-                // first exception, start of a batch, and not list
-                if self.max_definition == 0 {
-                    1
-                } else if self.is_list {
-                    // second exception, always increment after a list
-                    self.max_definition + 1
-                } else if is_parent_struct && !self.is_nullable {
-                    // if the parent is a non-null struct, don't increment
-                    self.max_definition
-                } else {
-                    self.max_definition + is_nullable as i16
-                }
+        let max_definition = match (self.level_type, level_type) {
+            (LevelType::Root, LevelType::Struct(is_nullable)) => {
+                // If the struct is non-nullable, its def level doesn't increment
+                is_nullable as i16
             }
-            true => {
-                if is_parent_struct && !self.is_nullable {
-                    self.max_definition + is_nullable as i16
-                } else {
-                    self.max_definition + 1 + is_nullable as i16
-                }
+            (LevelType::Root, _) => 1,
+            (_, LevelType::Root) => {
+                unreachable!("Cannot have a root as a child")
+            }
+            (LevelType::List(_), _) => {
+                // Always add 1  (TDDO: document why)

Review comment:
       I wonder if this TODO is still valid




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] nevi-me commented on pull request #270: Fix null struct and list roundtrip

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #270:
URL: https://github.com/apache/arrow-rs/pull/270#issuecomment-837861950


   I've opened #282 to track the remaining issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] nevi-me commented on pull request #270: Fix null struct and list roundtrip

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #270:
URL: https://github.com/apache/arrow-rs/pull/270#issuecomment-836762589


   @jorgecarleitao @alamb there's still errors with the more complex cases like `<struct<list<list<struct<list<primitive>>>>>>`, where one of the items is 0 length. I'm going to work on that case this week.
   
   I discovered the issue by chance when trying to benchmark the writer with deeply nested lists and structs. `pyarrow` manages to write this correctly (wrote an IPC file from Rust, then read it in pyarrow and wrote that to parquet)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] nevi-me commented on a change in pull request #270: Fix null struct and list roundtrip

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #270:
URL: https://github.com/apache/arrow-rs/pull/270#discussion_r629401310



##########
File path: parquet/src/arrow/levels.rs
##########
@@ -315,80 +314,40 @@ impl LevelInfo {
         // we use 64-bit offsets to also accommodate large arrays
         array_offsets: Vec<i64>,
         array_mask: Vec<bool>,
-        is_list: bool,
-        is_parent_struct: bool,
-        is_nullable: bool,
+        level_type: LevelType,
     ) -> Self {
         let min_len = *(array_offsets.last().unwrap()) as usize;
         let mut definition = Vec::with_capacity(min_len);
         let mut repetition = Vec::with_capacity(min_len);
         let mut merged_array_mask = Vec::with_capacity(min_len);
 
-        // determine the total level increment based on data types
-        let max_definition = match is_list {
-            false => {
-                // first exception, start of a batch, and not list
-                if self.max_definition == 0 {
-                    1
-                } else if self.is_list {
-                    // second exception, always increment after a list
-                    self.max_definition + 1
-                } else if is_parent_struct && !self.is_nullable {
-                    // if the parent is a non-null struct, don't increment
-                    self.max_definition
-                } else {
-                    self.max_definition + is_nullable as i16
-                }
+        let max_definition = match (self.level_type, level_type) {
+            (LevelType::Root, LevelType::Struct(is_nullable)) => {
+                // If the struct is non-nullable, its def level doesn't increment
+                is_nullable as i16
             }
-            true => {
-                if is_parent_struct && !self.is_nullable {
-                    self.max_definition + is_nullable as i16
-                } else {
-                    self.max_definition + 1 + is_nullable as i16
-                }
+            (LevelType::Root, _) => 1,
+            (_, LevelType::Root) => {
+                unreachable!("Cannot have a root as a child")
+            }
+            (LevelType::List(_), _) => {
+                // Always add 1  (TDDO: document why)

Review comment:
       Missed that, I'll fix that in a few hours




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on pull request #270: Fix null struct and list roundtrip

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #270:
URL: https://github.com/apache/arrow-rs/pull/270#issuecomment-835781821


   Hi @nevi-me  -- thank you. I will probably will not have a chance to review this fully until tomorrow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter commented on pull request #270: Fix null struct and list roundtrip

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #270:
URL: https://github.com/apache/arrow-rs/pull/270#issuecomment-835401693


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/270?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#270](https://codecov.io/gh/apache/arrow-rs/pull/270?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (f779e23) into [master](https://codecov.io/gh/apache/arrow-rs/commit/8bd769bc118f617580cfaa7b1e654159c0fb696b?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (8bd769b) will **increase** coverage by `0.02%`.
   > The diff coverage is `87.67%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/270/graphs/tree.svg?width=650&height=150&src=pr&token=pq9V9qWZ1N&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/270?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master     #270      +/-   ##
   ==========================================
   + Coverage   82.53%   82.56%   +0.02%     
   ==========================================
     Files         162      162              
     Lines       43796    43822      +26     
   ==========================================
   + Hits        36149    36180      +31     
   + Misses       7647     7642       -5     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/270?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/270/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyLnJz) | `77.12% <80.95%> (-0.02%)` | :arrow_down: |
   | [parquet/src/arrow/levels.rs](https://codecov.io/gh/apache/arrow-rs/pull/270/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvbGV2ZWxzLnJz) | `82.59% <87.15%> (+1.33%)` | :arrow_up: |
   | [parquet/src/arrow/arrow\_writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/270/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyb3dfd3JpdGVyLnJz) | `98.45% <100.00%> (+0.02%)` | :arrow_up: |
   | [parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow-rs/pull/270/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZW5jb2RpbmdzL2VuY29kaW5nLnJz) | `94.86% <0.00%> (-0.20%)` | :arrow_down: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/270?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/270?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [8bd769b...f779e23](https://codecov.io/gh/apache/arrow-rs/pull/270?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org