You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/21 20:08:54 UTC

[GitHub] [arrow-rs] alamb commented on a diff in pull request #2116: Simplify null mask preservation in parquet reader

alamb commented on code in PR #2116:
URL: https://github.com/apache/arrow-rs/pull/2116#discussion_r927056687


##########
parquet/src/column/reader/decoder.rs:
##########
@@ -277,25 +288,25 @@ enum LevelDecoderInner {
 impl ColumnLevelDecoder for ColumnLevelDecoderImpl {
     type Slice = [i16];
 
-    fn new(max_level: i16, encoding: Encoding, data: ByteBufferPtr) -> Self {
-        let bit_width = num_required_bits(max_level as u64);
+    fn set_data(&mut self, encoding: Encoding, data: ByteBufferPtr) {

Review Comment:
   I am not an expert in this area, but the new code structure seems to make sense to me



##########
parquet/src/arrow/array_reader/byte_array.rs:
##########
@@ -127,15 +124,11 @@ impl<I: OffsetSizeTrait + ScalarValue> ArrayReader for ByteArrayReader<I> {
     }
 
     fn get_def_levels(&self) -> Option<&[i16]> {
-        self.def_levels_buffer
-            .as_ref()
-            .map(|buf| buf.typed_data())
+        self.def_levels_buffer.as_ref().map(|buf| buf.typed_data())

Review Comment:
   it is not entirely clear to me why the formatting changed on these lines -- not that it is a bad change, but it seems like it wasn't a semantic change either 🤷 



##########
parquet/src/arrow/record_reader/mod.rs:
##########
@@ -76,33 +77,8 @@ where
 {
     /// Create a new [`GenericRecordReader`]
     pub fn new(desc: ColumnDescPtr) -> Self {
-        Self::new_with_options(desc, false)
-    }
-
-    /// Create a new [`GenericRecordReader`] with the ability to only generate the bitmask
-    ///
-    /// If `null_mask_only` is true only the null bitmask will be generated and
-    /// [`Self::consume_def_levels`] and [`Self::consume_rep_levels`] will always return `None`
-    ///
-    /// It is insufficient to solely check that that the max definition level is 1 as we
-    /// need there to be no nullable parent array that will required decoded definition levels
-    ///
-    /// In particular consider the case of:
-    ///
-    /// ```ignore
-    /// message nested {
-    ///   OPTIONAL Group group {
-    ///     REQUIRED INT32 leaf;
-    ///   }
-    /// }
-    /// ```
-    ///
-    /// The maximum definition level of leaf is 1, however, we still need to decode the
-    /// definition levels so that the parent group can be constructed correctly
-    ///
-    pub(crate) fn new_with_options(desc: ColumnDescPtr, null_mask_only: bool) -> Self {
         let def_levels = (desc.max_def_level() > 0)
-            .then(|| DefinitionLevelBuffer::new(&desc, null_mask_only));
+            .then(|| DefinitionLevelBuffer::new(&desc, packed_null_mask(&desc)));

Review Comment:
   Is this is the key change in this PR? that the decision to use a null mask is pushed down to this level?



##########
parquet/src/column/reader.rs:
##########
@@ -195,7 +195,6 @@ where
     ///
     /// `values` will be contiguously populated with the non-null values. Note that if the column
     /// is not required, this may be less than either `batch_size` or the number of levels read
-    #[inline]

Review Comment:
   as in "when you leave `inline` the benchmarks get slower"?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org