You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/20 21:29:59 UTC

[GitHub] [arrow-rs] tustvold opened a new pull request, #2116: Simplify null mask preservation

tustvold opened a new pull request, #2116:
URL: https://github.com/apache/arrow-rs/pull/2116

   # Which issue does this PR close?
   
   Part of #2107
   
   # Rationale for this change
    
   The original logic is very confusing as it determines whether to use a packed decoder, based on the type of DefinitionLevelBuffer passed to `DefinitionLevelBufferDecoder`. Not only is this confusing, but it creates a problem when skipping the first records in a column chunk, as the type of decoder is not known until data has been read :scream:
   
   This largely dated from a time when `GenericRecordReader` was generic over the levels in addition to values. In the end I removed this prior to merge as it was unnecessary complexity.
   
   # What changes are included in this PR?
   
   Explicitly construct the decoders in GenericRecordReader, and passes them to `GenericColumnReader::new_with_decoders`. This allows adding an additional constructor parameter to `DefinitionLevelBufferDecoder` to instruct it whether to decode packed or not.
   
   # Are there any user-facing changes?
   
   No, all these traits are crate private


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold merged pull request #2116: Simplify null mask preservation in parquet reader

Posted by GitBox <gi...@apache.org>.
tustvold merged PR #2116:
URL: https://github.com/apache/arrow-rs/pull/2116


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2116: Simplify null mask preservation

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2116:
URL: https://github.com/apache/arrow-rs/pull/2116#discussion_r927037087


##########
parquet/src/column/reader.rs:
##########
@@ -195,7 +195,6 @@ where
     ///
     /// `values` will be contiguously populated with the non-null values. Note that if the column
     /// is not required, this may be less than either `batch_size` or the number of levels read
-    #[inline]

Review Comment:
   It would appear that this can result in sub-optimal inlining behaviour, in particular when compiling the parquet crate there is a noticeable performance degredation. Unfortunately the inlined code is so mangled that I've been unable to determine exactly what is going on, but I may revisit this at a later date



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] codecov-commenter commented on pull request #2116: Simplify null mask preservation

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on PR #2116:
URL: https://github.com/apache/arrow-rs/pull/2116#issuecomment-1190803729

   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/2116?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#2116](https://codecov.io/gh/apache/arrow-rs/pull/2116?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (fd039ec) into [master](https://codecov.io/gh/apache/arrow-rs/commit/efd3152f85f182757138dd5984c9832c3257448c?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (efd3152) will **increase** coverage by `0.04%`.
   > The diff coverage is `88.34%`.
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #2116      +/-   ##
   ==========================================
   + Coverage   83.73%   83.78%   +0.04%     
   ==========================================
     Files         225      225              
     Lines       59412    59472      +60     
   ==========================================
   + Hits        49748    49827      +79     
   + Misses       9664     9645      -19     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/2116?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage ฮ” | |
   |---|---|---|
   | [parquet/src/column/reader/decoder.rs](https://codecov.io/gh/apache/arrow-rs/pull/2116/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvY29sdW1uL3JlYWRlci9kZWNvZGVyLnJz) | `61.33% <60.00%> (-2.05%)` | :arrow_down: |
   | [...rquet/src/arrow/record\_reader/definition\_levels.rs](https://codecov.io/gh/apache/arrow-rs/pull/2116/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvcmVjb3JkX3JlYWRlci9kZWZpbml0aW9uX2xldmVscy5ycw==) | `88.00% <83.33%> (+2.81%)` | :arrow_up: |
   | [parquet/src/arrow/array\_reader/byte\_array.rs](https://codecov.io/gh/apache/arrow-rs/pull/2116/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyL2J5dGVfYXJyYXkucnM=) | `85.79% <90.90%> (-0.57%)` | :arrow_down: |
   | [parquet/src/arrow/array\_reader/builder.rs](https://codecov.io/gh/apache/arrow-rs/pull/2116/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyL2J1aWxkZXIucnM=) | `93.15% <100.00%> (-0.36%)` | :arrow_down: |
   | [...et/src/arrow/array\_reader/byte\_array\_dictionary.rs](https://codecov.io/gh/apache/arrow-rs/pull/2116/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyL2J5dGVfYXJyYXlfZGljdGlvbmFyeS5ycw==) | `86.97% <100.00%> (+2.51%)` | :arrow_up: |
   | [parquet/src/arrow/array\_reader/primitive\_array.rs](https://codecov.io/gh/apache/arrow-rs/pull/2116/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyL3ByaW1pdGl2ZV9hcnJheS5ycw==) | `89.01% <100.00%> (+0.38%)` | :arrow_up: |
   | [parquet/src/arrow/record\_reader/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/2116/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvcmVjb3JkX3JlYWRlci9tb2QucnM=) | `89.60% <100.00%> (+0.43%)` | :arrow_up: |
   | [parquet/src/column/reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/2116/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvY29sdW1uL3JlYWRlci5ycw==) | `62.87% <100.00%> (+0.87%)` | :arrow_up: |
   | [arrow/src/array/builder/boolean\_builder.rs](https://codecov.io/gh/apache/arrow-rs/pull/2116/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2J1aWxkZXIvYm9vbGVhbl9idWlsZGVyLnJz) | `86.73% <0.00%> (-0.53%)` | :arrow_down: |
   | [arrow/src/array/builder/generic\_binary\_builder.rs](https://codecov.io/gh/apache/arrow-rs/pull/2116/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2J1aWxkZXIvZ2VuZXJpY19iaW5hcnlfYnVpbGRlci5ycw==) | `83.33% <0.00%> (-0.46%)` | :arrow_down: |
   | ... and [30 more](https://codecov.io/gh/apache/arrow-rs/pull/2116/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on pull request #2116: Simplify null mask preservation

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on PR #2116:
URL: https://github.com/apache/arrow-rs/pull/2116#issuecomment-1190941868

   > when skipping the first records in a column chunk, as the type of decoder is not known until data has been read.
   I haven't realize this ๐Ÿ˜‚


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2116: Simplify null mask preservation

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2116:
URL: https://github.com/apache/arrow-rs/pull/2116#discussion_r926198540


##########
parquet/src/arrow/record_reader/mod.rs:
##########
@@ -120,9 +96,25 @@ where
 
     /// Set the current page reader.
     pub fn set_page_reader(&mut self, page_reader: Box<dyn PageReader>) -> Result<()> {
-        self.column_reader = Some(GenericColumnReader::new(
+        let descr = &self.column_desc;
+        let values_decoder = CV::new(descr);
+
+        let def_level_decoder = (descr.max_def_level() != 0).then(|| {

Review Comment:
   I think choose values_decoder type only according to `column_desc` it's file level not page level. 
   So i think we don't need check every time in `set_page_reader` am i right?



##########
parquet/src/arrow/record_reader/mod.rs:
##########
@@ -120,9 +96,25 @@ where
 
     /// Set the current page reader.
     pub fn set_page_reader(&mut self, page_reader: Box<dyn PageReader>) -> Result<()> {
-        self.column_reader = Some(GenericColumnReader::new(
+        let descr = &self.column_desc;
+        let values_decoder = CV::new(descr);
+
+        let def_level_decoder = (descr.max_def_level() != 0).then(|| {

Review Comment:
   I think choose values_decoder type only according to `column_desc` it's file level not page level. 
   So i think we don't need check every time in `set_page_reader` am i right? ๐Ÿค”



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2116: Simplify null mask preservation

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2116:
URL: https://github.com/apache/arrow-rs/pull/2116#discussion_r926198825


##########
parquet/src/arrow/record_reader/mod.rs:
##########
@@ -120,9 +96,25 @@ where
 
     /// Set the current page reader.
     pub fn set_page_reader(&mut self, page_reader: Box<dyn PageReader>) -> Result<()> {
-        self.column_reader = Some(GenericColumnReader::new(
+        let descr = &self.column_desc;
+        let values_decoder = CV::new(descr);
+
+        let def_level_decoder = (descr.max_def_level() != 0).then(|| {
+            DefinitionLevelBufferDecoder::new(
+                descr.max_def_level(),
+                packed_null_mask(descr),

Review Comment:
   I think choose values_decoder type only according to `column_desc` it's file level not page level. 
   So i think we don't need check every time in `set_page_reader` am i right? ๐Ÿค”



##########
parquet/src/arrow/record_reader/mod.rs:
##########
@@ -120,9 +96,25 @@ where
 
     /// Set the current page reader.
     pub fn set_page_reader(&mut self, page_reader: Box<dyn PageReader>) -> Result<()> {
-        self.column_reader = Some(GenericColumnReader::new(
+        let descr = &self.column_desc;
+        let values_decoder = CV::new(descr);
+
+        let def_level_decoder = (descr.max_def_level() != 0).then(|| {
+            DefinitionLevelBufferDecoder::new(
+                descr.max_def_level(),
+                packed_null_mask(descr),

Review Comment:
   One column chunk has the same def_level_decoder



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2116: Simplify null mask preservation

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2116:
URL: https://github.com/apache/arrow-rs/pull/2116#discussion_r926198825


##########
parquet/src/arrow/record_reader/mod.rs:
##########
@@ -120,9 +96,25 @@ where
 
     /// Set the current page reader.
     pub fn set_page_reader(&mut self, page_reader: Box<dyn PageReader>) -> Result<()> {
-        self.column_reader = Some(GenericColumnReader::new(
+        let descr = &self.column_desc;
+        let values_decoder = CV::new(descr);
+
+        let def_level_decoder = (descr.max_def_level() != 0).then(|| {
+            DefinitionLevelBufferDecoder::new(
+                descr.max_def_level(),
+                packed_null_mask(descr),

Review Comment:
   I think choose values_decoder type only according to `column_desc` it's file level not page level. 
   So i think we don't need check every time in `set_page_reader` am i right? ๐Ÿค”



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2116: Simplify null mask preservation

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2116:
URL: https://github.com/apache/arrow-rs/pull/2116#discussion_r926222221


##########
parquet/src/arrow/record_reader/mod.rs:
##########
@@ -120,9 +96,25 @@ where
 
     /// Set the current page reader.
     pub fn set_page_reader(&mut self, page_reader: Box<dyn PageReader>) -> Result<()> {
-        self.column_reader = Some(GenericColumnReader::new(
+        let descr = &self.column_desc;
+        let values_decoder = CV::new(descr);
+
+        let def_level_decoder = (descr.max_def_level() != 0).then(|| {
+            DefinitionLevelBufferDecoder::new(
+                descr.max_def_level(),
+                packed_null_mask(descr),

Review Comment:
   oops, i am wrong. i found need set `LevelDecoderInner` for each page, means : each page def_level and rep_level may in dif encoding ? right๐Ÿ˜‚ 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2116: Simplify null mask preservation

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2116:
URL: https://github.com/apache/arrow-rs/pull/2116#discussion_r926222221


##########
parquet/src/arrow/record_reader/mod.rs:
##########
@@ -120,9 +96,25 @@ where
 
     /// Set the current page reader.
     pub fn set_page_reader(&mut self, page_reader: Box<dyn PageReader>) -> Result<()> {
-        self.column_reader = Some(GenericColumnReader::new(
+        let descr = &self.column_desc;
+        let values_decoder = CV::new(descr);
+
+        let def_level_decoder = (descr.max_def_level() != 0).then(|| {
+            DefinitionLevelBufferDecoder::new(
+                descr.max_def_level(),
+                packed_null_mask(descr),

Review Comment:
   oops, i am wrong. i found need set `LevelDecoderInner` for each page, means : each page def_level and rep_level may in dif encoding ? right๐Ÿ˜‚ 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2116: Simplify null mask preservation in parquet reader

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2116:
URL: https://github.com/apache/arrow-rs/pull/2116#discussion_r927063188


##########
parquet/src/arrow/array_reader/byte_array.rs:
##########
@@ -127,15 +124,11 @@ impl<I: OffsetSizeTrait + ScalarValue> ArrayReader for ByteArrayReader<I> {
     }
 
     fn get_def_levels(&self) -> Option<&[i16]> {
-        self.def_levels_buffer
-            .as_ref()
-            .map(|buf| buf.typed_data())
+        self.def_levels_buffer.as_ref().map(|buf| buf.typed_data())

Review Comment:
   I don't know either...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] ursabot commented on pull request #2116: Simplify null mask preservation in parquet reader

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #2116:
URL: https://github.com/apache/arrow-rs/pull/2116#issuecomment-1192674175

   Benchmark runs are scheduled for baseline = 5e3facf5ec6c15f8fec14538b96b90b254e7ab90 and contender = a9fa1b496fff8d879424ad7644b9581855503b39. a9fa1b496fff8d879424ad7644b9581855503b39 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/90162bf8180f47d4884c5d36926a1e09...c8375fa7b71e450fa5f2f71773652ddf/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/e81663b553474c279e4d45fbe25017f3...0007b79ac1164becb632b151b34d960d/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/43147e8b57e542d486b78a21c6f79958...3fe67e32fe9e4ba7999495d6b4e16df6/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/14a27fc907954fe289d18b7a8c669c9b...b454f0eb96744000b1597eb36841057d/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on pull request #2116: Simplify null mask preservation

Posted by GitBox <gi...@apache.org>.
tustvold commented on PR #2116:
URL: https://github.com/apache/arrow-rs/pull/2116#issuecomment-1191546760

   Perplexingly if you run the arrow_reader benchmark from the crate root, this does not represent a performance regression, but if you run it from within the parquet crate, it does... I'm not really sure what to make of this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2116: Simplify null mask preservation in parquet reader

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2116:
URL: https://github.com/apache/arrow-rs/pull/2116#discussion_r927063011


##########
parquet/src/arrow/record_reader/mod.rs:
##########
@@ -76,33 +77,8 @@ where
 {
     /// Create a new [`GenericRecordReader`]
     pub fn new(desc: ColumnDescPtr) -> Self {
-        Self::new_with_options(desc, false)
-    }
-
-    /// Create a new [`GenericRecordReader`] with the ability to only generate the bitmask
-    ///
-    /// If `null_mask_only` is true only the null bitmask will be generated and
-    /// [`Self::consume_def_levels`] and [`Self::consume_rep_levels`] will always return `None`
-    ///
-    /// It is insufficient to solely check that that the max definition level is 1 as we
-    /// need there to be no nullable parent array that will required decoded definition levels
-    ///
-    /// In particular consider the case of:
-    ///
-    /// ```ignore
-    /// message nested {
-    ///   OPTIONAL Group group {
-    ///     REQUIRED INT32 leaf;
-    ///   }
-    /// }
-    /// ```
-    ///
-    /// The maximum definition level of leaf is 1, however, we still need to decode the
-    /// definition levels so that the parent group can be constructed correctly
-    ///
-    pub(crate) fn new_with_options(desc: ColumnDescPtr, null_mask_only: bool) -> Self {
         let def_levels = (desc.max_def_level() > 0)
-            .then(|| DefinitionLevelBuffer::new(&desc, null_mask_only));
+            .then(|| DefinitionLevelBuffer::new(&desc, packed_null_mask(&desc)));

Review Comment:
   Correct :+1:



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on pull request #2116: Simplify null mask preservation

Posted by GitBox <gi...@apache.org>.
tustvold commented on PR #2116:
URL: https://github.com/apache/arrow-rs/pull/2116#issuecomment-1190837373

   For some reason this represents a performance regression... More investigation needed :thinking: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on pull request #2116: Simplify null mask preservation

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on PR #2116:
URL: https://github.com/apache/arrow-rs/pull/2116#issuecomment-1191342590

    i run IT in my modified version, it fail at skip first !๐Ÿ˜ญ
   But works well on read first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on pull request #2116: Simplify null mask preservation

Posted by GitBox <gi...@apache.org>.
tustvold commented on PR #2116:
URL: https://github.com/apache/arrow-rs/pull/2116#issuecomment-1191479191

   >  So i think the type of packed decoder is known before read data.
   
   Correct, this is just a plumbing exercise to get that knowledge into the decoder at construction time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on pull request #2116: Simplify null mask preservation

Posted by GitBox <gi...@apache.org>.
tustvold commented on PR #2116:
URL: https://github.com/apache/arrow-rs/pull/2116#issuecomment-1190777792

   I intend to run benchmarks for this shortly to confirm no regression


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2116: Simplify null mask preservation

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2116:
URL: https://github.com/apache/arrow-rs/pull/2116#discussion_r926199337


##########
parquet/src/arrow/record_reader/mod.rs:
##########
@@ -120,9 +96,25 @@ where
 
     /// Set the current page reader.
     pub fn set_page_reader(&mut self, page_reader: Box<dyn PageReader>) -> Result<()> {
-        self.column_reader = Some(GenericColumnReader::new(
+        let descr = &self.column_desc;
+        let values_decoder = CV::new(descr);
+
+        let def_level_decoder = (descr.max_def_level() != 0).then(|| {
+            DefinitionLevelBufferDecoder::new(
+                descr.max_def_level(),
+                packed_null_mask(descr),

Review Comment:
   One column chunk has the same def_level_decoder



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on pull request #2116: Simplify null mask preservation in parquet reader

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #2116:
URL: https://github.com/apache/arrow-rs/pull/2116#issuecomment-1191894016

   CI Failure should be fixed by https://github.com/apache/arrow-rs/pull/2121


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #2116: Simplify null mask preservation in parquet reader

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2116:
URL: https://github.com/apache/arrow-rs/pull/2116#discussion_r927056687


##########
parquet/src/column/reader/decoder.rs:
##########
@@ -277,25 +288,25 @@ enum LevelDecoderInner {
 impl ColumnLevelDecoder for ColumnLevelDecoderImpl {
     type Slice = [i16];
 
-    fn new(max_level: i16, encoding: Encoding, data: ByteBufferPtr) -> Self {
-        let bit_width = num_required_bits(max_level as u64);
+    fn set_data(&mut self, encoding: Encoding, data: ByteBufferPtr) {

Review Comment:
   I am not an expert in this area, but the new code structure seems to make sense to me



##########
parquet/src/arrow/array_reader/byte_array.rs:
##########
@@ -127,15 +124,11 @@ impl<I: OffsetSizeTrait + ScalarValue> ArrayReader for ByteArrayReader<I> {
     }
 
     fn get_def_levels(&self) -> Option<&[i16]> {
-        self.def_levels_buffer
-            .as_ref()
-            .map(|buf| buf.typed_data())
+        self.def_levels_buffer.as_ref().map(|buf| buf.typed_data())

Review Comment:
   it is not entirely clear to me why the formatting changed on these lines -- not that it is a bad change, but it seems like it wasn't a semantic change either ๐Ÿคท 



##########
parquet/src/arrow/record_reader/mod.rs:
##########
@@ -76,33 +77,8 @@ where
 {
     /// Create a new [`GenericRecordReader`]
     pub fn new(desc: ColumnDescPtr) -> Self {
-        Self::new_with_options(desc, false)
-    }
-
-    /// Create a new [`GenericRecordReader`] with the ability to only generate the bitmask
-    ///
-    /// If `null_mask_only` is true only the null bitmask will be generated and
-    /// [`Self::consume_def_levels`] and [`Self::consume_rep_levels`] will always return `None`
-    ///
-    /// It is insufficient to solely check that that the max definition level is 1 as we
-    /// need there to be no nullable parent array that will required decoded definition levels
-    ///
-    /// In particular consider the case of:
-    ///
-    /// ```ignore
-    /// message nested {
-    ///   OPTIONAL Group group {
-    ///     REQUIRED INT32 leaf;
-    ///   }
-    /// }
-    /// ```
-    ///
-    /// The maximum definition level of leaf is 1, however, we still need to decode the
-    /// definition levels so that the parent group can be constructed correctly
-    ///
-    pub(crate) fn new_with_options(desc: ColumnDescPtr, null_mask_only: bool) -> Self {
         let def_levels = (desc.max_def_level() > 0)
-            .then(|| DefinitionLevelBuffer::new(&desc, null_mask_only));
+            .then(|| DefinitionLevelBuffer::new(&desc, packed_null_mask(&desc)));

Review Comment:
   Is this is the key change in this PR? that the decision to use a null mask is pushed down to this level?



##########
parquet/src/column/reader.rs:
##########
@@ -195,7 +195,6 @@ where
     ///
     /// `values` will be contiguously populated with the non-null values. Note that if the column
     /// is not required, this may be less than either `batch_size` or the number of levels read
-    #[inline]

Review Comment:
   as in "when you leave `inline` the benchmarks get slower"?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2116: Simplify null mask preservation in parquet reader

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2116:
URL: https://github.com/apache/arrow-rs/pull/2116#discussion_r927063681


##########
parquet/src/column/reader.rs:
##########
@@ -195,7 +195,6 @@ where
     ///
     /// `values` will be contiguously populated with the non-null values. Note that if the column
     /// is not required, this may be less than either `batch_size` or the number of levels read
-    #[inline]

Review Comment:
   When you leave `inline` certain benchmarks get slower when compiled from the parquet crate, although there is no difference when compiled from the workspace level... It is incredibly strange...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org