You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/26 07:45:46 UTC

[GitHub] [arrow-rs] atefsawaed opened a new pull request, #1621: Fix decimals min max statistics

atefsawaed opened a new pull request, #1621:
URL: https://github.com/apache/arrow-rs/pull/1621

   # Which issue does this PR close?
   
   Closes #1532.
   
   
   # Rationale for this change
    
   
   # What changes are included in this PR?
   
   <!---
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   Added a suitable comparator for Decimals that are built with byte arrays.
   
   # Are there any user-facing changes?
   No.
   
   <!---
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] atefsawaed commented on a diff in pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
atefsawaed commented on code in PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#discussion_r861656516


##########
parquet/src/column/writer.rs:
##########
@@ -1049,6 +1084,54 @@ fn has_dictionary_support(kind: Type, props: &WriterProperties) -> bool {
     }
 }
 
+/// Signed comparison of bytes arrays
+fn compare_greater_byte_array_decimals(a: &[u8], b: &[u8]) -> bool {
+    let a_length = a.len();
+    let b_length = b.len();
+
+    if a_length == 0 || b_length == 0 {
+        return a_length > 0 && b_length == 0;
+    }

Review Comment:
   Sort of. This is a safe check for empty buffers (that can represent null values). 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#issuecomment-1113572563

   Thanks @viirya  and @atefsawaed 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] atefsawaed commented on a diff in pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
atefsawaed commented on code in PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#discussion_r861661323


##########
parquet/src/column/writer.rs:
##########
@@ -1049,6 +1084,54 @@ fn has_dictionary_support(kind: Type, props: &WriterProperties) -> bool {
     }
 }
 
+/// Signed comparison of bytes arrays
+fn compare_greater_byte_array_decimals(a: &[u8], b: &[u8]) -> bool {
+    let a_length = a.len();
+    let b_length = b.len();
+
+    if a_length == 0 || b_length == 0 {
+        return a_length > 0 && b_length == 0;

Review Comment:
   Yes, good catch!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] atefsawaed commented on a diff in pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
atefsawaed commented on code in PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#discussion_r861191944


##########
parquet/src/column/writer.rs:
##########
@@ -1049,6 +1084,54 @@ fn has_dictionary_support(kind: Type, props: &WriterProperties) -> bool {
     }
 }
 
+/// Signed comparison of bytes arrays
+fn compare_greater_byte_array_decimals(a: &[u8], b: &[u8]) -> bool {
+    let a_length = a.len();
+    let b_length = b.len();
+
+    if a_length == 0 || b_length == 0 {
+        return a_length > 0 && b_length == 0;
+    }
+
+    let first_a: u8 = a[0];
+    let first_b: u8 = b[0];
+
+    // We can short circuit for different signed numbers or
+    // for equal length bytes arrays that have different first bytes.
+    // The equality requirement is necessary for sign extension cases.
+    // 0xFF10 should be equal to 0x10 (due to big endian sign extension).
+    if (0x80 & first_a) != (0x80 & first_b)
+        || (a_length == b_length && first_a != first_b)

Review Comment:
   In case of equal length bytes arrays with different first bytes, it is enough to compare the first bytes in order to determine the ordering. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#discussion_r858997894


##########
parquet/src/column/writer.rs:
##########
@@ -2153,4 +2296,49 @@ mod tests {
             panic!("metadata missing statistics");
         }
     }
+
+    /// Returns Decimals column writer.
+    fn get_test_decimals_column_writer<T: DataType>(
+        page_writer: Box<dyn PageWriter>,
+        max_def_level: i16,
+        max_rep_level: i16,
+        props: WriterPropertiesPtr,
+    ) -> ColumnWriterImpl<T> {
+        let descr = Arc::new(get_test_decimals_column_descr::<T>(
+            max_def_level,
+            max_rep_level,
+        ));
+        let column_writer = get_column_writer(descr, props, page_writer);
+        get_typed_column_writer::<T>(column_writer)
+    }
+
+    /// Returns decimals column reader.
+    fn get_test_decimals_column_reader<T: DataType>(
+        page_reader: Box<dyn PageReader>,
+        max_def_level: i16,
+        max_rep_level: i16,
+    ) -> ColumnReaderImpl<T> {
+        let descr = Arc::new(get_test_decimals_column_descr::<T>(
+            max_def_level,
+            max_rep_level,
+        ));
+        let column_reader = get_column_reader(descr, page_reader);
+        get_typed_column_reader::<T>(column_reader)
+    }
+
+    /// Returns descriptor for Decimal type with primitive column.
+    fn get_test_decimals_column_descr<T: DataType>(
+        max_def_level: i16,
+        max_rep_level: i16,
+    ) -> ColumnDescriptor {
+        let path = ColumnPath::from("col");
+        let tpe = SchemaType::primitive_type_builder("col", T::get_physical_type())
+            .with_length(16)
+            .with_logical_type(Some(LogicalType::DECIMAL(parquet_format::DecimalType::new(2, 3))))

Review Comment:
   ```suggestion
               .with_logical_type(Some(LogicalType::Decimal {
                   scale: 2,
                   precision: 3,
               }))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] atefsawaed commented on a diff in pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
atefsawaed commented on code in PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#discussion_r860665565


##########
parquet/src/column/writer.rs:
##########
@@ -2153,4 +2346,94 @@ mod tests {
             panic!("metadata missing statistics");
         }
     }
+
+    /// Returns Decimals column writer.
+    fn get_test_decimals_column_writer<T: DataType>(
+        page_writer: Box<dyn PageWriter>,
+        max_def_level: i16,
+        max_rep_level: i16,
+        props: WriterPropertiesPtr,
+    ) -> ColumnWriterImpl<T> {
+        let descr = Arc::new(get_test_decimals_column_descr::<T>(
+            max_def_level,
+            max_rep_level,
+        ));
+        let column_writer = get_column_writer(descr, props, page_writer);
+        get_typed_column_writer::<T>(column_writer)
+    }
+
+    /// Returns decimals column reader.
+    fn get_test_decimals_column_reader<T: DataType>(
+        page_reader: Box<dyn PageReader>,
+        max_def_level: i16,
+        max_rep_level: i16,
+    ) -> ColumnReaderImpl<T> {
+        let descr = Arc::new(get_test_decimals_column_descr::<T>(
+            max_def_level,
+            max_rep_level,
+        ));
+        let column_reader = get_column_reader(descr, page_reader);
+        get_typed_column_reader::<T>(column_reader)
+    }
+
+    /// Returns descriptor for Decimal type with primitive column.
+    fn get_test_decimals_column_descr<T: DataType>(
+        max_def_level: i16,
+        max_rep_level: i16,
+    ) -> ColumnDescriptor {
+        let path = ColumnPath::from("col");
+        let tpe = SchemaType::primitive_type_builder("col", T::get_physical_type())
+            .with_length(16)
+            .with_logical_type(Some(LogicalType::Decimal {
+                scale: 2,
+                precision: 3,
+            }))
+            .with_scale(2)
+            .with_precision(3)
+            .build()
+            .unwrap();
+        ColumnDescriptor::new(Arc::new(tpe), max_def_level, max_rep_level, path)
+    }
+
+    /// Returns column writer for UINT32 Column (Provided as ConvertedType without specifying the LogicalType).
+    fn get_test_unsigned_int_given_as_converted_column_writer<T: DataType>(

Review Comment:
   Sorry, I missed it. I am using it now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#discussion_r861204945


##########
parquet/src/column/writer.rs:
##########
@@ -1049,6 +1084,54 @@ fn has_dictionary_support(kind: Type, props: &WriterProperties) -> bool {
     }
 }
 
+/// Signed comparison of bytes arrays
+fn compare_greater_byte_array_decimals(a: &[u8], b: &[u8]) -> bool {
+    let a_length = a.len();
+    let b_length = b.len();
+
+    if a_length == 0 || b_length == 0 {
+        return a_length > 0 && b_length == 0;
+    }

Review Comment:
   Hmm, I don't see corresponding code for this in Java's implementation?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] atefsawaed commented on a diff in pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
atefsawaed commented on code in PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#discussion_r861667292


##########
parquet/src/column/writer.rs:
##########
@@ -1049,6 +1084,54 @@ fn has_dictionary_support(kind: Type, props: &WriterProperties) -> bool {
     }
 }
 
+/// Signed comparison of bytes arrays
+fn compare_greater_byte_array_decimals(a: &[u8], b: &[u8]) -> bool {
+    let a_length = a.len();
+    let b_length = b.len();
+
+    if a_length == 0 || b_length == 0 {
+        return a_length > 0 && b_length == 0;
+    }
+
+    let first_a: u8 = a[0];
+    let first_b: u8 = b[0];
+
+    // We can short circuit for different signed numbers or
+    // for equal length bytes arrays that have different first bytes.
+    // The equality requirement is necessary for sign extension cases.
+    // 0xFF10 should be equal to 0x10 (due to big endian sign extension).
+    if (0x80 & first_a) != (0x80 & first_b)
+        || (a_length == b_length && first_a != first_b)
+    {
+        return (first_a as i8) > (first_b as i8);
+    }
+
+    // When the lengths are unequal and the numbers are of the same
+    // sign we need to do comparison by sign extending the shorter
+    // value first, and once we get to equal sized arrays, lexicographical
+    // unsigned comparison of everything but the first byte is sufficient.
+
+    let extension: u8 = if (first_a as i8) < 0 { 0xFF } else { 0 };
+
+    if a_length != b_length {
+        let not_equal = if a_length > b_length {
+            let lead_length = a_length - b_length;
+            Some((&a[0..lead_length]).iter().any(|&x| x != extension))
+        } else {
+            let lead_length = b_length - a_length;
+            Some((&b[0..lead_length]).iter().any(|&x| x != extension))
+        };
+
+        if not_equal.unwrap() {
+            let negative_values: bool = (first_a as i8) < 0;
+            let a_longer: bool = a_length > b_length;
+            return if negative_values { !a_longer } else { a_longer };
+        }
+    }
+
+    (a[1..]) > (b[1..])

Review Comment:
   No, in this case the lexicographical unsigned comparison of both buffers is sufficient.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] atefsawaed commented on a diff in pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
atefsawaed commented on code in PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#discussion_r861194276


##########
parquet/src/column/writer.rs:
##########
@@ -1049,6 +1084,54 @@ fn has_dictionary_support(kind: Type, props: &WriterProperties) -> bool {
     }
 }
 
+/// Signed comparison of bytes arrays
+fn compare_greater_byte_array_decimals(a: &[u8], b: &[u8]) -> bool {
+    let a_length = a.len();
+    let b_length = b.len();
+
+    if a_length == 0 || b_length == 0 {
+        return a_length > 0 && b_length == 0;
+    }
+
+    let first_a: u8 = a[0];
+    let first_b: u8 = b[0];
+
+    // We can short circuit for different signed numbers or
+    // for equal length bytes arrays that have different first bytes.
+    // The equality requirement is necessary for sign extension cases.
+    // 0xFF10 should be equal to 0x10 (due to big endian sign extension).
+    if (0x80 & first_a) != (0x80 & first_b)
+        || (a_length == b_length && first_a != first_b)
+    {
+        return (first_a as i8) > (first_b as i8);
+    }
+
+    // When the lengths are unequal and the numbers are of the same
+    // sign we need to do comparison by sign extending the shorter
+    // value first, and once we get to equal sized arrays, lexicographical
+    // unsigned comparison of everything but the first byte is sufficient.
+
+    let extension: u8 = if (first_a as i8) < 0 { 0xFF } else { 0 };
+
+    if a_length != b_length {
+        let not_equal: Option<bool> = if a_length > b_length {

Review Comment:
   Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#discussion_r860076451


##########
parquet/src/column/writer.rs:
##########
@@ -1006,6 +1057,42 @@ impl<T: DataType> ColumnWriterImpl<T> {
                 return a.as_u64().unwrap() > b.as_u64().unwrap();
             }
         }
+
+        match self.descr.converted_type() {
+            ConvertedType::UINT_8
+            | ConvertedType::UINT_16
+            | ConvertedType::UINT_32
+            | ConvertedType::UINT_64 => {
+                return a.as_u64().unwrap() > b.as_u64().unwrap();
+            }
+            _ => {}
+        };

Review Comment:
   Is it possible to add test for this too?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] atefsawaed commented on a diff in pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
atefsawaed commented on code in PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#discussion_r859977870


##########
parquet/src/column/writer.rs:
##########
@@ -998,6 +998,57 @@ impl<T: DataType> ColumnWriterImpl<T> {
         }
     }
 
+    fn compare_greater_byte_array_decimals(&self, a: &[u8], b: &[u8]) -> bool {

Review Comment:
   Can you please advise on how to add a test for a private function in this module? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#issuecomment-1113571874

   Since the MIRI test takes a while to run (and it will run again on the branch prior to release) I am going to merge this PR in even before it is complete so I can make the release candidate. 🚀 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] atefsawaed commented on a diff in pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
atefsawaed commented on code in PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#discussion_r859979054


##########
parquet/src/column/writer.rs:
##########
@@ -1006,6 +1057,42 @@ impl<T: DataType> ColumnWriterImpl<T> {
                 return a.as_u64().unwrap() > b.as_u64().unwrap();
             }
         }
+
+        match self.descr.converted_type() {
+            ConvertedType::UINT_8
+            | ConvertedType::UINT_16
+            | ConvertedType::UINT_32
+            | ConvertedType::UINT_64 => {
+                return a.as_u64().unwrap() > b.as_u64().unwrap();
+            }
+            _ => {}
+        };

Review Comment:
   No. This is to capture Integers specified using their converted type only (No logical type provided). 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] atefsawaed commented on a diff in pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
atefsawaed commented on code in PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#discussion_r861192398


##########
parquet/src/column/writer.rs:
##########
@@ -1049,6 +1084,54 @@ fn has_dictionary_support(kind: Type, props: &WriterProperties) -> bool {
     }
 }
 
+/// Signed comparison of bytes arrays
+fn compare_greater_byte_array_decimals(a: &[u8], b: &[u8]) -> bool {

Review Comment:
   Do you want me to add this as comment?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#discussion_r859001474


##########
parquet/src/column/writer.rs:
##########
@@ -1006,6 +1057,42 @@ impl<T: DataType> ColumnWriterImpl<T> {
                 return a.as_u64().unwrap() > b.as_u64().unwrap();
             }
         }
+
+        match self.descr.converted_type() {
+            ConvertedType::UINT_8
+            | ConvertedType::UINT_16
+            | ConvertedType::UINT_32
+            | ConvertedType::UINT_64 => {
+                return a.as_u64().unwrap() > b.as_u64().unwrap();
+            }
+            _ => {}
+        };

Review Comment:
   Is this for decimal fix too?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] sunchao commented on a diff in pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
sunchao commented on code in PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#discussion_r861211500


##########
parquet/src/column/writer.rs:
##########
@@ -1049,6 +1084,54 @@ fn has_dictionary_support(kind: Type, props: &WriterProperties) -> bool {
     }
 }
 
+/// Signed comparison of bytes arrays
+fn compare_greater_byte_array_decimals(a: &[u8], b: &[u8]) -> bool {
+    let a_length = a.len();
+    let b_length = b.len();
+
+    if a_length == 0 || b_length == 0 {
+        return a_length > 0 && b_length == 0;
+    }
+
+    let first_a: u8 = a[0];
+    let first_b: u8 = b[0];
+
+    // We can short circuit for different signed numbers or
+    // for equal length bytes arrays that have different first bytes.
+    // The equality requirement is necessary for sign extension cases.
+    // 0xFF10 should be equal to 0x10 (due to big endian sign extension).
+    if (0x80 & first_a) != (0x80 & first_b)
+        || (a_length == b_length && first_a != first_b)
+    {
+        return (first_a as i8) > (first_b as i8);
+    }
+
+    // When the lengths are unequal and the numbers are of the same
+    // sign we need to do comparison by sign extending the shorter
+    // value first, and once we get to equal sized arrays, lexicographical
+    // unsigned comparison of everything but the first byte is sufficient.
+
+    let extension: u8 = if (first_a as i8) < 0 { 0xFF } else { 0 };
+
+    if a_length != b_length {
+        let not_equal = if a_length > b_length {
+            let lead_length = a_length - b_length;
+            Some((&a[0..lead_length]).iter().any(|&x| x != extension))

Review Comment:
   nit: you can just do
   ```
               (&a[0..lead_length]).iter().any(|&x| x != extension)
   ```
   so `not_equal` is not an option.



##########
parquet/src/column/writer.rs:
##########
@@ -1049,6 +1084,54 @@ fn has_dictionary_support(kind: Type, props: &WriterProperties) -> bool {
     }
 }
 
+/// Signed comparison of bytes arrays
+fn compare_greater_byte_array_decimals(a: &[u8], b: &[u8]) -> bool {
+    let a_length = a.len();
+    let b_length = b.len();
+
+    if a_length == 0 || b_length == 0 {
+        return a_length > 0 && b_length == 0;

Review Comment:
   nit: can this just be `return a_length > 0;`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#discussion_r860071596


##########
parquet/src/column/writer.rs:
##########
@@ -1006,6 +1057,42 @@ impl<T: DataType> ColumnWriterImpl<T> {
                 return a.as_u64().unwrap() > b.as_u64().unwrap();
             }
         }
+
+        match self.descr.converted_type() {
+            ConvertedType::UINT_8
+            | ConvertedType::UINT_16
+            | ConvertedType::UINT_32
+            | ConvertedType::UINT_64 => {
+                return a.as_u64().unwrap() > b.as_u64().unwrap();
+            }
+            _ => {}
+        };

Review Comment:
   Oh, I saw you have listed it in the issue description. nvm
   
   Maybe you can mention it in the PR title too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] sunchao commented on a diff in pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
sunchao commented on code in PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#discussion_r861071161


##########
parquet/src/column/writer.rs:
##########
@@ -1006,6 +1006,41 @@ impl<T: DataType> ColumnWriterImpl<T> {
                 return a.as_u64().unwrap() > b.as_u64().unwrap();
             }
         }
+
+        match self.descr.converted_type() {
+            ConvertedType::UINT_8
+            | ConvertedType::UINT_16
+            | ConvertedType::UINT_32
+            | ConvertedType::UINT_64 => {
+                return a.as_u64().unwrap() > b.as_u64().unwrap();
+            }
+            _ => {}
+        };
+
+        if let Some(LogicalType::Decimal { .. }) = self.descr.logical_type() {
+            match self.descr.physical_type() {

Review Comment:
   Ah you are right.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#discussion_r861204945


##########
parquet/src/column/writer.rs:
##########
@@ -1049,6 +1084,54 @@ fn has_dictionary_support(kind: Type, props: &WriterProperties) -> bool {
     }
 }
 
+/// Signed comparison of bytes arrays
+fn compare_greater_byte_array_decimals(a: &[u8], b: &[u8]) -> bool {
+    let a_length = a.len();
+    let b_length = b.len();
+
+    if a_length == 0 || b_length == 0 {
+        return a_length > 0 && b_length == 0;
+    }

Review Comment:
   Hmm, I don't see corresponding code for this in Java's implementation? Is this for null values?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#discussion_r860066367


##########
parquet/src/column/writer.rs:
##########
@@ -998,6 +998,57 @@ impl<T: DataType> ColumnWriterImpl<T> {
         }
     }
 
+    fn compare_greater_byte_array_decimals(&self, a: &[u8], b: &[u8]) -> bool {

Review Comment:
   I don't see it needs to access `self`, so maybe you can move it out of `ColumnWriterImpl`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] sunchao commented on a diff in pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
sunchao commented on code in PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#discussion_r861077721


##########
parquet/src/column/writer.rs:
##########
@@ -1049,6 +1084,54 @@ fn has_dictionary_support(kind: Type, props: &WriterProperties) -> bool {
     }
 }
 
+/// Signed comparison of bytes arrays
+fn compare_greater_byte_array_decimals(a: &[u8], b: &[u8]) -> bool {
+    let a_length = a.len();
+    let b_length = b.len();
+
+    if a_length == 0 || b_length == 0 {
+        return a_length > 0 && b_length == 0;
+    }
+
+    let first_a: u8 = a[0];
+    let first_b: u8 = b[0];
+
+    // We can short circuit for different signed numbers or
+    // for equal length bytes arrays that have different first bytes.
+    // The equality requirement is necessary for sign extension cases.
+    // 0xFF10 should be equal to 0x10 (due to big endian sign extension).
+    if (0x80 & first_a) != (0x80 & first_b)
+        || (a_length == b_length && first_a != first_b)

Review Comment:
   Hmm why we need this second condition?



##########
parquet/src/column/writer.rs:
##########
@@ -1049,6 +1084,54 @@ fn has_dictionary_support(kind: Type, props: &WriterProperties) -> bool {
     }
 }
 
+/// Signed comparison of bytes arrays
+fn compare_greater_byte_array_decimals(a: &[u8], b: &[u8]) -> bool {
+    let a_length = a.len();
+    let b_length = b.len();
+
+    if a_length == 0 || b_length == 0 {
+        return a_length > 0 && b_length == 0;
+    }
+
+    let first_a: u8 = a[0];
+    let first_b: u8 = b[0];
+
+    // We can short circuit for different signed numbers or
+    // for equal length bytes arrays that have different first bytes.
+    // The equality requirement is necessary for sign extension cases.
+    // 0xFF10 should be equal to 0x10 (due to big endian sign extension).
+    if (0x80 & first_a) != (0x80 & first_b)
+        || (a_length == b_length && first_a != first_b)
+    {
+        return (first_a as i8) > (first_b as i8);
+    }
+
+    // When the lengths are unequal and the numbers are of the same
+    // sign we need to do comparison by sign extending the shorter
+    // value first, and once we get to equal sized arrays, lexicographical
+    // unsigned comparison of everything but the first byte is sufficient.
+
+    let extension: u8 = if (first_a as i8) < 0 { 0xFF } else { 0 };
+
+    if a_length != b_length {
+        let not_equal: Option<bool> = if a_length > b_length {

Review Comment:
   why this is a `Option`? seems it is always `Some`.



##########
parquet/src/column/writer.rs:
##########
@@ -1049,6 +1084,54 @@ fn has_dictionary_support(kind: Type, props: &WriterProperties) -> bool {
     }
 }
 
+/// Signed comparison of bytes arrays
+fn compare_greater_byte_array_decimals(a: &[u8], b: &[u8]) -> bool {
+    let a_length = a.len();
+    let b_length = b.len();
+
+    if a_length == 0 || b_length == 0 {
+        return a_length > 0 && b_length == 0;
+    }
+
+    let first_a: u8 = a[0];
+    let first_b: u8 = b[0];
+
+    // We can short circuit for different signed numbers or
+    // for equal length bytes arrays that have different first bytes.
+    // The equality requirement is necessary for sign extension cases.
+    // 0xFF10 should be equal to 0x10 (due to big endian sign extension).
+    if (0x80 & first_a) != (0x80 & first_b)
+        || (a_length == b_length && first_a != first_b)
+    {
+        return (first_a as i8) > (first_b as i8);
+    }
+
+    // When the lengths are unequal and the numbers are of the same
+    // sign we need to do comparison by sign extending the shorter
+    // value first, and once we get to equal sized arrays, lexicographical
+    // unsigned comparison of everything but the first byte is sufficient.
+
+    let extension: u8 = if (first_a as i8) < 0 { 0xFF } else { 0 };
+
+    if a_length != b_length {
+        let not_equal: Option<bool> = if a_length > b_length {
+            let lead_length = a_length - b_length;
+            Some((&a[0..lead_length]).iter().any(|&x| x != extension))
+        } else {
+            let lead_length = b_length - a_length;
+            Some((&b[0..lead_length]).iter().any(|&x| x != extension))
+        };
+
+        if not_equal.is_some() && not_equal.unwrap() {
+            let negative_values: bool = (first_a as i8) < 0;
+            let a_longer: bool = a_length > b_length;
+            return negative_values != a_longer;

Review Comment:
   personally I feel it's easier to read via:
   ```rust
   return if negative_values { !a_longer } else { a_longer };
   ```



##########
parquet/src/column/writer.rs:
##########
@@ -2153,4 +2336,94 @@ mod tests {
             panic!("metadata missing statistics");
         }
     }
+
+    /// Returns Decimals column writer.
+    fn get_test_decimals_column_writer<T: DataType>(
+        page_writer: Box<dyn PageWriter>,
+        max_def_level: i16,
+        max_rep_level: i16,
+        props: WriterPropertiesPtr,
+    ) -> ColumnWriterImpl<T> {
+        let descr = Arc::new(get_test_decimals_column_descr::<T>(
+            max_def_level,
+            max_rep_level,
+        ));
+        let column_writer = get_column_writer(descr, props, page_writer);
+        get_typed_column_writer::<T>(column_writer)
+    }
+
+    /// Returns decimals column reader.
+    fn get_test_decimals_column_reader<T: DataType>(
+        page_reader: Box<dyn PageReader>,
+        max_def_level: i16,
+        max_rep_level: i16,
+    ) -> ColumnReaderImpl<T> {
+        let descr = Arc::new(get_test_decimals_column_descr::<T>(
+            max_def_level,
+            max_rep_level,
+        ));
+        let column_reader = get_column_reader(descr, page_reader);
+        get_typed_column_reader::<T>(column_reader)
+    }
+
+    /// Returns descriptor for Decimal type with primitive column.
+    fn get_test_decimals_column_descr<T: DataType>(
+        max_def_level: i16,
+        max_rep_level: i16,
+    ) -> ColumnDescriptor {
+        let path = ColumnPath::from("col");
+        let tpe = SchemaType::primitive_type_builder("col", T::get_physical_type())
+            .with_length(16)
+            .with_logical_type(Some(LogicalType::Decimal {
+                scale: 2,
+                precision: 3,
+            }))
+            .with_scale(2)
+            .with_precision(3)
+            .build()
+            .unwrap();
+        ColumnDescriptor::new(Arc::new(tpe), max_def_level, max_rep_level, path)
+    }
+
+    /// Returns column writer for UINT32 Column (Provided as ConvertedType without specifying the LogicalType).

Review Comment:
   nit: can we limit line length to 90 characters?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#discussion_r861204945


##########
parquet/src/column/writer.rs:
##########
@@ -1049,6 +1084,54 @@ fn has_dictionary_support(kind: Type, props: &WriterProperties) -> bool {
     }
 }
 
+/// Signed comparison of bytes arrays
+fn compare_greater_byte_array_decimals(a: &[u8], b: &[u8]) -> bool {
+    let a_length = a.len();
+    let b_length = b.len();
+
+    if a_length == 0 || b_length == 0 {
+        return a_length > 0 && b_length == 0;
+    }

Review Comment:
   Hmm, I don't see corresponding code for this in Java's implementation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#discussion_r861199463


##########
parquet/src/column/writer.rs:
##########
@@ -1049,6 +1084,54 @@ fn has_dictionary_support(kind: Type, props: &WriterProperties) -> bool {
     }
 }
 
+/// Signed comparison of bytes arrays
+fn compare_greater_byte_array_decimals(a: &[u8], b: &[u8]) -> bool {

Review Comment:
   No, I just post it for other reviewers as a reference. But it might also be good to add into the comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
viirya commented on PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#issuecomment-1113531862

   I will merge this after CI passes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] atefsawaed commented on a diff in pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
atefsawaed commented on code in PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#discussion_r860161257


##########
parquet/src/column/writer.rs:
##########
@@ -998,6 +998,57 @@ impl<T: DataType> ColumnWriterImpl<T> {
         }
     }
 
+    fn compare_greater_byte_array_decimals(&self, a: &[u8], b: &[u8]) -> bool {

Review Comment:
   Yes, I will move it and write a test for it.



##########
parquet/src/column/writer.rs:
##########
@@ -1006,6 +1057,42 @@ impl<T: DataType> ColumnWriterImpl<T> {
                 return a.as_u64().unwrap() > b.as_u64().unwrap();
             }
         }
+
+        match self.descr.converted_type() {
+            ConvertedType::UINT_8
+            | ConvertedType::UINT_16
+            | ConvertedType::UINT_32
+            | ConvertedType::UINT_64 => {
+                return a.as_u64().unwrap() > b.as_u64().unwrap();
+            }
+            _ => {}
+        };

Review Comment:
   Yes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] atefsawaed commented on a diff in pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
atefsawaed commented on code in PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#discussion_r861033402


##########
parquet/src/column/writer.rs:
##########
@@ -1006,6 +1006,41 @@ impl<T: DataType> ColumnWriterImpl<T> {
                 return a.as_u64().unwrap() > b.as_u64().unwrap();
             }
         }
+
+        match self.descr.converted_type() {
+            ConvertedType::UINT_8
+            | ConvertedType::UINT_16
+            | ConvertedType::UINT_32
+            | ConvertedType::UINT_64 => {
+                return a.as_u64().unwrap() > b.as_u64().unwrap();
+            }
+            _ => {}
+        };
+
+        if let Some(LogicalType::Decimal { .. }) = self.descr.logical_type() {
+            match self.descr.physical_type() {

Review Comment:
   That's right. In this case, signed comparison of the integer values produces the correct ordering and will be done in line 1044.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#discussion_r860296880


##########
parquet/src/column/writer.rs:
##########
@@ -2153,4 +2346,94 @@ mod tests {
             panic!("metadata missing statistics");
         }
     }
+
+    /// Returns Decimals column writer.
+    fn get_test_decimals_column_writer<T: DataType>(
+        page_writer: Box<dyn PageWriter>,
+        max_def_level: i16,
+        max_rep_level: i16,
+        props: WriterPropertiesPtr,
+    ) -> ColumnWriterImpl<T> {
+        let descr = Arc::new(get_test_decimals_column_descr::<T>(
+            max_def_level,
+            max_rep_level,
+        ));
+        let column_writer = get_column_writer(descr, props, page_writer);
+        get_typed_column_writer::<T>(column_writer)
+    }
+
+    /// Returns decimals column reader.
+    fn get_test_decimals_column_reader<T: DataType>(
+        page_reader: Box<dyn PageReader>,
+        max_def_level: i16,
+        max_rep_level: i16,
+    ) -> ColumnReaderImpl<T> {
+        let descr = Arc::new(get_test_decimals_column_descr::<T>(
+            max_def_level,
+            max_rep_level,
+        ));
+        let column_reader = get_column_reader(descr, page_reader);
+        get_typed_column_reader::<T>(column_reader)
+    }
+
+    /// Returns descriptor for Decimal type with primitive column.
+    fn get_test_decimals_column_descr<T: DataType>(
+        max_def_level: i16,
+        max_rep_level: i16,
+    ) -> ColumnDescriptor {
+        let path = ColumnPath::from("col");
+        let tpe = SchemaType::primitive_type_builder("col", T::get_physical_type())
+            .with_length(16)
+            .with_logical_type(Some(LogicalType::Decimal {
+                scale: 2,
+                precision: 3,
+            }))
+            .with_scale(2)
+            .with_precision(3)
+            .build()
+            .unwrap();
+        ColumnDescriptor::new(Arc::new(tpe), max_def_level, max_rep_level, path)
+    }
+
+    /// Returns column writer for UINT32 Column (Provided as ConvertedType without specifying the LogicalType).
+    fn get_test_unsigned_int_given_as_converted_column_writer<T: DataType>(

Review Comment:
   Hmm, where is this used?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#issuecomment-1111543495

   Hi @viirya, could you please help to trigger the CI processes? Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb merged pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
alamb merged PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] sunchao commented on a diff in pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
sunchao commented on code in PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#discussion_r861024451


##########
parquet/src/column/writer.rs:
##########
@@ -1006,6 +1006,41 @@ impl<T: DataType> ColumnWriterImpl<T> {
                 return a.as_u64().unwrap() > b.as_u64().unwrap();
             }
         }
+
+        match self.descr.converted_type() {
+            ConvertedType::UINT_8
+            | ConvertedType::UINT_16
+            | ConvertedType::UINT_32
+            | ConvertedType::UINT_64 => {
+                return a.as_u64().unwrap() > b.as_u64().unwrap();
+            }
+            _ => {}
+        };
+
+        if let Some(LogicalType::Decimal { .. }) = self.descr.logical_type() {
+            match self.descr.physical_type() {

Review Comment:
   Decimals can be represented via `INT32` and `INT64` too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#discussion_r861114580


##########
parquet/src/column/writer.rs:
##########
@@ -1049,6 +1084,54 @@ fn has_dictionary_support(kind: Type, props: &WriterProperties) -> bool {
     }
 }
 
+/// Signed comparison of bytes arrays
+fn compare_greater_byte_array_decimals(a: &[u8], b: &[u8]) -> bool {

Review Comment:
   For reference: Java Parquet's implementation: https://github.com/apache/parquet-mr/blob/06bb358bcf8a0855c54f20122a57a88d9fde16c1/parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveComparator.java#L213



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#discussion_r860067775


##########
parquet/src/column/writer.rs:
##########
@@ -1006,6 +1057,42 @@ impl<T: DataType> ColumnWriterImpl<T> {
                 return a.as_u64().unwrap() > b.as_u64().unwrap();
             }
         }
+
+        match self.descr.converted_type() {
+            ConvertedType::UINT_8
+            | ConvertedType::UINT_16
+            | ConvertedType::UINT_32
+            | ConvertedType::UINT_64 => {
+                return a.as_u64().unwrap() > b.as_u64().unwrap();
+            }
+            _ => {}
+        };

Review Comment:
   So this is another issue?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#discussion_r861218868


##########
parquet/src/column/writer.rs:
##########
@@ -1049,6 +1084,54 @@ fn has_dictionary_support(kind: Type, props: &WriterProperties) -> bool {
     }
 }
 
+/// Signed comparison of bytes arrays
+fn compare_greater_byte_array_decimals(a: &[u8], b: &[u8]) -> bool {
+    let a_length = a.len();
+    let b_length = b.len();
+
+    if a_length == 0 || b_length == 0 {
+        return a_length > 0 && b_length == 0;
+    }
+
+    let first_a: u8 = a[0];
+    let first_b: u8 = b[0];
+
+    // We can short circuit for different signed numbers or
+    // for equal length bytes arrays that have different first bytes.
+    // The equality requirement is necessary for sign extension cases.
+    // 0xFF10 should be equal to 0x10 (due to big endian sign extension).
+    if (0x80 & first_a) != (0x80 & first_b)
+        || (a_length == b_length && first_a != first_b)
+    {
+        return (first_a as i8) > (first_b as i8);
+    }
+
+    // When the lengths are unequal and the numbers are of the same
+    // sign we need to do comparison by sign extending the shorter
+    // value first, and once we get to equal sized arrays, lexicographical
+    // unsigned comparison of everything but the first byte is sufficient.
+
+    let extension: u8 = if (first_a as i8) < 0 { 0xFF } else { 0 };
+
+    if a_length != b_length {
+        let not_equal = if a_length > b_length {
+            let lead_length = a_length - b_length;
+            Some((&a[0..lead_length]).iter().any(|&x| x != extension))
+        } else {
+            let lead_length = b_length - a_length;
+            Some((&b[0..lead_length]).iter().any(|&x| x != extension))
+        };
+
+        if not_equal.unwrap() {
+            let negative_values: bool = (first_a as i8) < 0;
+            let a_longer: bool = a_length > b_length;
+            return if negative_values { !a_longer } else { a_longer };
+        }
+    }
+
+    (a[1..]) > (b[1..])

Review Comment:
   If the beginning of the longer buffer equals to the padding, do we need to compare leading bytes again?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#discussion_r859006120


##########
parquet/src/column/writer.rs:
##########
@@ -998,6 +998,57 @@ impl<T: DataType> ColumnWriterImpl<T> {
         }
     }
 
+    fn compare_greater_byte_array_decimals(&self, a: &[u8], b: &[u8]) -> bool {

Review Comment:
   Would be nice if we can have some tests for this function.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#discussion_r858834884


##########
parquet/src/column/writer.rs:
##########
@@ -1006,6 +1057,42 @@ impl<T: DataType> ColumnWriterImpl<T> {
                 return a.as_u64().unwrap() > b.as_u64().unwrap();
             }
         }
+
+        match self.descr.converted_type() {
+            ConvertedType::UINT_8
+            | ConvertedType::UINT_16
+            | ConvertedType::UINT_32
+            | ConvertedType::UINT_64 => {
+                return a.as_u64().unwrap() > b.as_u64().unwrap();
+            }
+            _ => {}
+        };
+
+        if let Some(LogicalType::DECIMAL(_)) = self.descr.logical_type() {

Review Comment:
   A few compilation errors:
   
   ```
   error[E0599]: no variant or associated item named `DECIMAL` found for enum `basic::LogicalType` in the current scope
       --> parquet/src/column/writer.rs:1071:34
        |
   1071 |         if let Some(LogicalType::DECIMAL(_)) = self.descr.logical_type() {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#issuecomment-1113547688

   I will hold off on making the 13.0.0 release candidate to include this fix


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on pull request #1621: Fix decimals min max statistics

Posted by GitBox <gi...@apache.org>.
viirya commented on PR #1621:
URL: https://github.com/apache/arrow-rs/pull/1621#issuecomment-1113577002

   Thanks @alamb @atefsawaed @sunchao @HaoYang670 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org