You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/28 23:47:11 UTC

[GitHub] [arrow-rs] tustvold opened a new pull request #1246: Fix NullArrayReader (#1245)

tustvold opened a new pull request #1246:
URL: https://github.com/apache/arrow-rs/pull/1246


   # Which issue does this PR close?
   
   Closes #1245.
   
   # Rationale for this change
    
   Originally reported by @bjchambers, a sanity check added in #1054 fires when reading NullArrays. 
   
   This highlighted two problems:
   
   * `NullArrayReader` has always been somewhat broken as it would never flush the accumulated null buffer, instead just growing it indefinitely
   * There were literally no tests for NullArrayReader :scream: 
   
   # What changes are included in this PR?
   
   Fixes the panic, and adds a basic test of the `NullArrayReader`
   
   # Are there any user-facing changes?
   
   NullArrayReader should work again


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb merged pull request #1246: Fix NullArrayReader (#1245)

Posted by GitBox <gi...@apache.org>.
alamb merged pull request #1246:
URL: https://github.com/apache/arrow-rs/pull/1246


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a change in pull request #1246: Fix NullArrayReader (#1245)

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1246:
URL: https://github.com/apache/arrow-rs/pull/1246#discussion_r794959399



##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -214,6 +214,10 @@ where
         // save definition and repetition buffers
         self.def_levels_buffer = self.record_reader.consume_def_levels()?;
         self.rep_levels_buffer = self.record_reader.consume_rep_levels()?;
+
+        // Must consume bitmap buffer
+        self.record_reader.consume_bitmap_buffer()?;

Review comment:
       Why this is even getting computed at all is definitely a valid question, but at the same time a workload where the speed of reading arrays of nulls is the bottleneck feels decidedly... niche :laughing: 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] bjchambers commented on pull request #1246: Fix NullArrayReader (#1245)

Posted by GitBox <gi...@apache.org>.
bjchambers commented on pull request #1246:
URL: https://github.com/apache/arrow-rs/pull/1246#issuecomment-1024929727


   No need to rush. Next week should be fine. I can just wait to upgrade until
   this is out. Thanks for the speedy fix!
   
   On Sat, Jan 29, 2022 at 3:19 AM Andrew Lamb ***@***.***>
   wrote:
   
   > ***@***.**** approved this pull request.
   >
   > Thanks @tustvold <https://github.com/tustvold> -- I'll get this out in
   > arrow 9.0.0 next week; I can try to release a patchset sooner if you need
   > @bjchambers <https://github.com/bjchambers>
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/arrow-rs/pull/1246#pullrequestreview-866951085>,
   > or unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AAAIY6HXTGYKE5TH6CONQVDUYPEKTANCNFSM5NCAZKAQ>
   > .
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] bjchambers commented on a change in pull request #1246: Fix NullArrayReader (#1245)

Posted by GitBox <gi...@apache.org>.
bjchambers commented on a change in pull request #1246:
URL: https://github.com/apache/arrow-rs/pull/1246#discussion_r794993876



##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -214,6 +214,10 @@ where
         // save definition and repetition buffers
         self.def_levels_buffer = self.record_reader.consume_def_levels()?;
         self.rep_levels_buffer = self.record_reader.consume_rep_levels()?;
+
+        // Must consume bitmap buffer
+        self.record_reader.consume_bitmap_buffer()?;

Review comment:
       One place it comes up is where you have multiple Parquet files representing a "table", with one or more columns being relatively sparse. If you have a file dropped every hour, then it may be that some of the files have a null column while others have a few values.
   
   It doesn't seem likely it would be the bottleneck (the other columns in the file probably would be), but that's at least how it's come up for us.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a change in pull request #1246: Fix NullArrayReader (#1245)

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1246:
URL: https://github.com/apache/arrow-rs/pull/1246#discussion_r795025401



##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -214,6 +214,10 @@ where
         // save definition and repetition buffers
         self.def_levels_buffer = self.record_reader.consume_def_levels()?;
         self.rep_levels_buffer = self.record_reader.consume_rep_levels()?;
+
+        // Must consume bitmap buffer
+        self.record_reader.consume_bitmap_buffer()?;

Review comment:
       Yeah, I didn't mean to suggest NullArrayReader itself is niche, but rather there is probably a limited benefit to optimising it to not compute the bitmask unnecessarily :smile: 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org