You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "Jefffrey (via GitHub)" <gi...@apache.org> on 2023/10/29 10:53:12 UTC

[PR] Parquet: read/write f16 for Arrow [arrow-rs]

Jefffrey opened a new pull request, #5003:
URL: https://github.com/apache/arrow-rs/pull/5003

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Relates to #4986
   
   # Rationale for this change
    
   <!--
   Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
   Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
   -->
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   Allow reading and writing f16 type from parquet to/from Arrow recordbatches
   
   # Are there any user-facing changes?
   
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Parquet: read/write f16 for Arrow [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on PR #5003:
URL: https://github.com/apache/arrow-rs/pull/5003#issuecomment-1798350610

   Thank you for this, in the interests of saving time I intend to review this after https://github.com/apache/parquet-testing/pull/40 has merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Parquet: read/write f16 for Arrow [arrow-rs]

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.
Jefffrey commented on PR #5003:
URL: https://github.com/apache/arrow-rs/pull/5003#issuecomment-1810097462

   I've raised https://github.com/apache/arrow-rs/issues/5075 and https://github.com/apache/arrow-rs/issues/5074 to follow up on the discussion


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Parquet: read/write f16 for Arrow [arrow-rs]

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on code in PR #5003:
URL: https://github.com/apache/arrow-rs/pull/5003#discussion_r1389675899


##########
parquet/src/column/writer/mod.rs:
##########
@@ -1170,6 +1188,7 @@ fn increment_utf8(mut data: Vec<u8>) -> Option<Vec<u8>> {
 mod tests {
     use crate::{file::properties::DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH, format::BoundaryOrder};

Review Comment:
   Are there tests for ensuring that `BoundaryOrder` is deduced correctly (i.e. not like a normal fixed binary)? I'd imagine that would be automatically handled by the Float16-specific comparators included here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Parquet: read/write f16 for Arrow [arrow-rs]

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.
Jefffrey commented on code in PR #5003:
URL: https://github.com/apache/arrow-rs/pull/5003#discussion_r1375413985


##########
parquet/src/file/statistics.rs:
##########
@@ -243,6 +243,8 @@ pub fn to_thrift(stats: Option<&Statistics>) -> Option<TStatistics> {
         distinct_count: stats.distinct_count().map(|value| value as i64),
         max_value: None,
         min_value: None,
+        is_max_value_exact: None,

Review Comment:
   Due to https://github.com/apache/parquet-format/commit/31f92c73ecca63b596b5edb391f9ac5eba9dbbf8
   
   Unsure if there is more work required to support this, that might require a separate issue?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Parquet: read/write f16 for Arrow [arrow-rs]

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.
Jefffrey commented on PR #5003:
URL: https://github.com/apache/arrow-rs/pull/5003#issuecomment-1786011573

   > I think before we merge this we should get a file containing this type added to parquet-testing, so that we can ensure interoperability with other implementations. Otherwise this looks reasonable to me
   
   There does seem to be this PR for parquet-testing: https://github.com/apache/parquet-testing/pull/40


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Parquet: read/write f16 for Arrow [arrow-rs]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #5003:
URL: https://github.com/apache/arrow-rs/pull/5003#issuecomment-1803925383

   @benibus Do you feel like taking a look at this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Parquet: read/write f16 for Arrow [arrow-rs]

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.
Jefffrey commented on code in PR #5003:
URL: https://github.com/apache/arrow-rs/pull/5003#discussion_r1390063433


##########
parquet/src/column/writer/mod.rs:
##########
@@ -1170,6 +1188,7 @@ fn increment_utf8(mut data: Vec<u8>) -> Option<Vec<u8>> {
 mod tests {
     use crate::{file::properties::DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH, format::BoundaryOrder};

Review Comment:
   Ah, I don't think I've considered that (nor truncation of statistics). Thanks for pointing that out, I'll check what changes will need to be made for those



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Parquet: read/write f16 for Arrow [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5003:
URL: https://github.com/apache/arrow-rs/pull/5003#discussion_r1376738909


##########
parquet/src/arrow/arrow_writer/mod.rs:
##########
@@ -771,6 +771,13 @@ fn write_leaf(writer: &mut ColumnWriter<'_>, levels: &ArrayLevels) -> Result<usi
                         .unwrap();
                     get_decimal_256_array_slice(array, indices)
                 }
+                ArrowDataType::Float16 => {
+                    let array = column
+                        .as_any()
+                        .downcast_ref::<arrow_array::Float16Array>()
+                        .unwrap();

Review Comment:
   ```suggestion
                       let array = column.as_primitive::<Float16Type>();
   ```



##########
parquet/src/arrow/arrow_reader/mod.rs:
##########
@@ -924,6 +925,53 @@ mod tests {
             .unwrap();
     }
 
+    #[test]
+    fn test_float16_roundtrip() -> Result<()> {
+        let schema = Arc::new(Schema::new(vec![Field::new(
+            "float16",
+            ArrowDataType::Float16,
+            true,
+        )]));
+
+        let mut buf = Vec::with_capacity(1024);
+        let mut writer = ArrowWriter::try_new(&mut buf, schema.clone(), None)?;
+
+        let original = RecordBatch::try_new(
+            schema,
+            vec![Arc::new(Float16Array::from_iter_values([
+                f16::EPSILON,
+                f16::INFINITY,
+                f16::MIN,
+                f16::MAX,
+                f16::NAN,
+                f16::INFINITY,
+                f16::NEG_INFINITY,
+                f16::ONE,
+                f16::NEG_ONE,
+                f16::ZERO,
+                f16::NEG_ZERO,
+                f16::E,
+                f16::PI,
+                f16::FRAC_1_PI,
+            ]))],
+        )?;
+
+        writer.write(&original)?;
+        writer.close()?;
+
+        let mut reader = ParquetRecordBatchReader::try_new(Bytes::from(buf), 1024)?;
+        let ret = reader.next().unwrap()?;
+        assert_eq!(ret, original);
+
+        // Ensure can be downcast to the correct type
+        ret.column(0)
+            .as_any()
+            .downcast_ref::<Float16Array>()
+            .unwrap();

Review Comment:
   ```suggestion
           ret.column(0).as_primitive::<Float16Type>();
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Parquet: read/write f16 for Arrow [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold merged PR #5003:
URL: https://github.com/apache/arrow-rs/pull/5003


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Parquet: read/write f16 for Arrow [arrow-rs]

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.
Jefffrey commented on code in PR #5003:
URL: https://github.com/apache/arrow-rs/pull/5003#discussion_r1390297511


##########
parquet/src/column/writer/mod.rs:
##########
@@ -1170,6 +1188,7 @@ fn increment_utf8(mut data: Vec<u8>) -> Option<Vec<u8>> {
 mod tests {
     use crate::{file::properties::DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH, format::BoundaryOrder};

Review Comment:
   Actually in regards to truncation, is it intended to disallow truncation of f16 stats or it should take place anyway?
   
   Nowhere in the new spec for f16 does it mention this case, and neither did I see relevant changes in https://github.com/apache/arrow/pull/36073, but perhaps I missed it?
   
   e.g. if user set the truncation for column index length to 1 byte, should we still truncate f16 to one byte since its underlying representation if Fixed len byte array, or should leave it as 2 bytes as truncating f16 doesn't make sense since it doesn't follow the sort order for fixed len byte arrays?
   
   @benibus 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Parquet: read/write f16 for Arrow [arrow-rs]

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.
Jefffrey commented on code in PR #5003:
URL: https://github.com/apache/arrow-rs/pull/5003#discussion_r1384372963


##########
parquet/src/column/writer/mod.rs:
##########
@@ -967,18 +969,23 @@ impl<'a, E: ColumnValueEncoder> GenericColumnWriter<'a, E> {
 }
 
 fn update_min<T: ParquetValueType>(descr: &ColumnDescriptor, val: &T, min: &mut Option<T>) {
-    update_stat::<T, _>(val, min, |cur| compare_greater(descr, cur, val))
+    update_stat::<T, _>(descr, val, min, |cur| compare_greater(descr, cur, val))
 }
 
 fn update_max<T: ParquetValueType>(descr: &ColumnDescriptor, val: &T, max: &mut Option<T>) {
-    update_stat::<T, _>(val, max, |cur| compare_greater(descr, val, cur))
+    update_stat::<T, _>(descr, val, max, |cur| compare_greater(descr, val, cur))
 }
 
 #[inline]
 #[allow(clippy::eq_op)]
-fn is_nan<T: ParquetValueType>(val: &T) -> bool {
+fn is_nan<T: ParquetValueType>(descr: &ColumnDescriptor, val: &T) -> bool {
     match T::PHYSICAL_TYPE {
         Type::FLOAT | Type::DOUBLE => val != val,
+        Type::FIXED_LEN_BYTE_ARRAY if descr.logical_type() == Some(LogicalType::Float16) => {
+            let val = val.as_bytes();
+            let val = f16::from_le_bytes([val[0], val[1]]);
+            val.is_nan()
+        }

Review Comment:
   Purely from type `T` we can't determine if the FixedLenByteArray represents a Float16 logical type, hence need `ColumnDescriptor` param to provide this information, to subsequently check NaN for f16



##########
parquet/src/column/writer/mod.rs:
##########
@@ -1038,6 +1049,14 @@ fn compare_greater<T: ParquetValueType>(descr: &ColumnDescriptor, a: &T, b: &T)
         };
     };
 
+    if let Some(LogicalType::Float16) = descr.logical_type() {
+        let a = a.as_bytes();
+        let a = f16::from_le_bytes([a[0], a[1]]);
+        let b = b.as_bytes();
+        let b = f16::from_le_bytes([b[0], b[1]]);
+        return a > b;
+    }

Review Comment:
   Don't compare as bytes, compare as f16's



##########
parquet/src/column/writer/mod.rs:
##########
@@ -2077,6 +2097,79 @@ mod tests {
         }
     }
 
+    #[test]
+    fn test_column_writer_check_float16_min_max() {
+        let input = [
+            -f16::ONE,
+            f16::from_f32(3.0),
+            -f16::from_f32(2.0),
+            f16::from_f32(2.0),
+        ]
+        .into_iter()
+        .map(|s| ByteArray::from(s).into())
+        .collect::<Vec<_>>();
+
+        let stats = float16_statistics_roundtrip(&input);
+        assert!(stats.has_min_max_set());
+        assert!(stats.is_min_max_backwards_compatible());
+        assert_eq!(stats.min(), &ByteArray::from(-f16::from_f32(2.0)));
+        assert_eq!(stats.max(), &ByteArray::from(f16::from_f32(3.0)));
+    }
+
+    #[test]
+    fn test_column_writer_check_float16_nan_middle() {
+        let input = [f16::ONE, f16::NAN, f16::ONE + f16::ONE]
+            .into_iter()
+            .map(|s| ByteArray::from(s).into())
+            .collect::<Vec<_>>();
+
+        let stats = float16_statistics_roundtrip(&input);
+        assert!(stats.has_min_max_set());
+        assert!(stats.is_min_max_backwards_compatible());
+        assert_eq!(stats.min(), &ByteArray::from(f16::ONE));
+        assert_eq!(stats.max(), &ByteArray::from(f16::ONE + f16::ONE));
+    }
+
+    #[test]
+    fn test_float16_statistics_nan_middle() {
+        let input = [f16::ONE, f16::NAN, f16::ONE + f16::ONE]
+            .into_iter()
+            .map(|s| ByteArray::from(s).into())
+            .collect::<Vec<_>>();
+
+        let stats = float16_statistics_roundtrip(&input);
+        assert!(stats.has_min_max_set());
+        assert!(stats.is_min_max_backwards_compatible());
+        assert_eq!(stats.min(), &ByteArray::from(f16::ONE));
+        assert_eq!(stats.max(), &ByteArray::from(f16::ONE + f16::ONE));
+    }
+
+    #[test]
+    fn test_float16_statistics_nan_start() {
+        let input = [f16::NAN, f16::ONE, f16::ONE + f16::ONE]
+            .into_iter()
+            .map(|s| ByteArray::from(s).into())
+            .collect::<Vec<_>>();
+
+        let stats = float16_statistics_roundtrip(&input);
+        assert!(stats.has_min_max_set());
+        assert!(stats.is_min_max_backwards_compatible());
+        assert_eq!(stats.min(), &ByteArray::from(f16::ONE));
+        assert_eq!(stats.max(), &ByteArray::from(f16::ONE + f16::ONE));
+    }
+
+    #[test]
+    fn test_float16_statistics_nan_only() {
+        let input = [f16::NAN, f16::NAN]
+            .into_iter()
+            .map(|s| ByteArray::from(s).into())
+            .collect::<Vec<_>>();
+
+        let stats = float16_statistics_roundtrip(&input);
+        assert!(!stats.has_min_max_set());
+        assert!(stats.is_min_max_backwards_compatible());
+    }

Review Comment:
   Ensuring NaN's are handled correctly in statistics (i.e. are ignored)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Parquet: read/write f16 for Arrow [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on PR #5003:
URL: https://github.com/apache/arrow-rs/pull/5003#issuecomment-1785940241

   There seem to be a lot of formatting changes in this PR, perhaps they could be split out to make it easier to see where the changes are?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Parquet: read/write f16 for Arrow [arrow-rs]

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.
Jefffrey commented on code in PR #5003:
URL: https://github.com/apache/arrow-rs/pull/5003#discussion_r1384173903


##########
parquet/src/file/statistics.rs:
##########
@@ -243,6 +243,8 @@ pub fn to_thrift(stats: Option<&Statistics>) -> Option<TStatistics> {
         distinct_count: stats.distinct_count().map(|value| value as i64),
         max_value: None,
         min_value: None,
+        is_max_value_exact: None,

Review Comment:
   Seems might be covered by/related to https://github.com/apache/arrow-rs/issues/5037



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Parquet: read/write f16 for Arrow [arrow-rs]

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.
Jefffrey commented on PR #5003:
URL: https://github.com/apache/arrow-rs/pull/5003#issuecomment-1797385065

   Just realized I need to also fix the statistics handling as well, otherwise I think it might write incorrect stats for files


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Parquet: read/write f16 for Arrow [arrow-rs]

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.
Jefffrey commented on code in PR #5003:
URL: https://github.com/apache/arrow-rs/pull/5003#discussion_r1391077317


##########
parquet/src/column/writer/mod.rs:
##########
@@ -1170,6 +1188,7 @@ fn increment_utf8(mut data: Vec<u8>) -> Option<Vec<u8>> {
 mod tests {
     use crate::{file::properties::DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH, format::BoundaryOrder};

Review Comment:
   Ok, so will need to insert special case for Float16 to ensure its stats won't get truncated
   
   Also regarding the `BoundaryOrder` it seems it isn't being derived in general yet, for any other types:
   
   https://github.com/apache/arrow-rs/blob/31b5724332666d68fe94a5b3572a13a51022ea81/parquet/src/file/metadata.rs#L888-L889
   
   I couldn't find where it might be set, so this could be a separate issue to be done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Parquet: read/write f16 for Arrow [arrow-rs]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #5003:
URL: https://github.com/apache/arrow-rs/pull/5003#discussion_r1390937524


##########
parquet/src/column/writer/mod.rs:
##########
@@ -1170,6 +1188,7 @@ fn increment_utf8(mut data: Vec<u8>) -> Option<Vec<u8>> {
 mod tests {
     use crate::{file::properties::DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH, format::BoundaryOrder};

Review Comment:
   Probably better to ignore the truncation limit in this case, IMO.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Parquet: read/write f16 for Arrow [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on PR #5003:
URL: https://github.com/apache/arrow-rs/pull/5003#issuecomment-1809281413

   I'm going to merge this as I think it moves things forward. Any follow up from https://github.com/apache/arrow-rs/pull/5003#discussion_r1389675899 I think can safely be handled in a subsequent PR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org