You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/11 15:26:27 UTC

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2045: Generify parquet write path (#1764)

tustvold commented on code in PR #2045:
URL: https://github.com/apache/arrow-rs/pull/2045#discussion_r918060047


##########
parquet/src/data_type.rs:
##########
@@ -581,18 +582,21 @@ pub(crate) mod private {
     /// crate, and thus hint to the type system (and end user) traits are public for the contract
     /// and not for extension.
     pub trait ParquetValueType:
-        std::cmp::PartialEq
+        PartialEq
         + std::fmt::Debug
         + std::fmt::Display
-        + std::default::Default
-        + std::clone::Clone
+        + Default
+        + Clone
         + super::AsBytes
         + super::FromBytes
-        + super::SliceAsBytes
+        + SliceAsBytes
         + PartialOrd
         + Send
         + crate::encodings::decoding::private::GetDecoder
+        + crate::file::statistics::private::MakeStatistics
     {
+        const PHYSICAL_TYPE: Type;

Review Comment:
   This is moved onto `ParquetValueType` as it allows implementing methods in terms of `ParquetValueType` which has the advantage that type inference works 



##########
parquet/src/file/statistics.rs:
##########
@@ -37,15 +37,45 @@
 //! }
 //! ```
 
-use std::{cmp, fmt};
+use std::fmt;
 
 use byteorder::{ByteOrder, LittleEndian};
 use parquet_format::Statistics as TStatistics;
 
 use crate::basic::Type;
+use crate::data_type::private::ParquetValueType;
 use crate::data_type::*;
 use crate::util::bit_util::from_ne_slice;
 
+pub(crate) mod private {
+    use super::*;
+
+    pub trait MakeStatistics {

Review Comment:
   I'm not a huge fan of this, but it was the only way I could work out to do this, and it is crate-local so if we come up with a better way we can change it in the future



##########
parquet/src/column/writer/mod.rs:
##########
@@ -305,42 +267,38 @@ impl<'a, T: DataType> ColumnWriterImpl<'a, T> {
         // TODO: find out why we don't account for size of levels when we estimate page
         // size.
 
-        // Find out the minimal length to prevent index out of bound errors.
-        let mut min_len = values.len();
-        if let Some(levels) = def_levels {
-            min_len = min_len.min(levels.len());
-        }
-        if let Some(levels) = rep_levels {
-            min_len = min_len.min(levels.len());
-        }
+        let num_levels = match def_levels {
+            Some(def_levels) => def_levels.len(),
+            None => values.len(),
+        };
 
         // Find out number of batches to process.
         let write_batch_size = self.props.write_batch_size();
-        let num_batches = min_len / write_batch_size;
+        let num_batches = num_levels / write_batch_size;

Review Comment:
   I think this was a bug before, previously the presence of nulls (which would lead to less values), would result in skewed batching



##########
parquet/src/basic.rs:
##########
@@ -212,7 +212,7 @@ pub enum Repetition {
 /// Encodings supported by Parquet.
 /// Not all encodings are valid for all types. These enums are also used to specify the
 /// encoding of definition and repetition levels.
-#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
+#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, Ord, PartialOrd)]

Review Comment:
   This is needed to allow for stable encoding ordering



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org