You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "benibus (via GitHub)" <gi...@apache.org> on 2023/06/14 19:39:03 UTC

[GitHub] [arrow] benibus opened a new pull request, #36073: GH-36036: [C++][Parquet] Implement Float16 logical type

benibus opened a new pull request, #36073:
URL: https://github.com/apache/arrow/pull/36073

   <!--
   Thanks for opening a pull request!
   If this is your first pull request you can find detailed information on how 
   to contribute here:
     * [New Contributor's Guide](https://arrow.apache.org/docs/dev/developers/guide/step_by_step/pr_lifecycle.html#reviews-and-merge-of-the-pull-request)
     * [Contributing Overview](https://arrow.apache.org/docs/dev/developers/overview.html)
   
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose
   
   Opening GitHub issues ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename the pull request title in the following format?
   
       GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   In the case of PARQUET issues on JIRA the title also supports:
   
       PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   -->
   
   ### Rationale for this change
   
   There is currently an active proposal to support half-float types in Parquet. For more details/discussion, see the links in this PR's accompanying issue.
   
   ### What changes are included in this PR?
   
   This PR implements basic support for a `Float16LogicalType` in accordance with the proposed spec. More specifically, this includes:
   
   - Changes to `parquet.thrift` and regenerated `parqet_types` files
   - Basic `LoticalType` class definition, method impls, and enums
   - Support for specialized comparisons and column statistics
   
   In the interest of scope, this PR does not currently deal with arrow integration and byte split encoding - although we will want both of these things before the proposal is approved.
   
   ### Are these changes tested?
   
   Yes (tests are included)
   
   ### Are there any user-facing changes?
   
   Yes
   
   <!--
   If there are any breaking changes to public APIs, please uncomment the line below and explain which changes are breaking.
   -->
   <!-- **This PR includes breaking changes to public APIs.** -->
   
   <!--
   Please uncomment the line below (and provide explanation) if the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld). We use this to highlight fixes to issues that may affect users without their knowledge. For this reason, fixing bugs that cause errors don't count, since those are usually obvious.
   -->
   <!-- **This PR contains a "Critical Fix".** -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] benibus commented on a diff in pull request #36073: GH-36036: [C++][Python][Parquet] Implement Float16 logical type

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1286318075


##########
cpp/src/arrow/util/float16.cc:
##########
@@ -0,0 +1,28 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <ostream>
+
+#include "arrow/util/float16.h"
+
+namespace arrow {
+namespace util {
+
+std::ostream& operator<<(std::ostream& os, Float16Base arg) { return (os << arg.bits()); }

Review Comment:
   Yep... the lack of conversion functions is the big elephant in the room. I initially avoided going there since I didn't expect they'd be needed for this PR, but their utility has become kinda obvious at this point.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] bkietz commented on a diff in pull request #36073: GH-36036: [C++][Python][Parquet] Implement Float16 logical type

Posted by "bkietz (via GitHub)" <gi...@apache.org>.
bkietz commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1307662503


##########
cpp/src/arrow/util/float16.h:
##########
@@ -0,0 +1,192 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <array>
+#include <cstdint>
+#include <cstring>
+#include <iosfwd>
+#include <limits>
+#include <type_traits>
+
+#include "arrow/util/endian.h"
+#include "arrow/util/ubsan.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// \brief Class representing an IEEE half-precision float, encoded as a `uint16_t`
+///
+/// The exact format is as follows (from LSB to MSB):
+/// - bits 0-10:  mantissa
+/// - bits 10-15: exponent
+/// - bit 15:     sign
+///
+class ARROW_EXPORT Float16 {
+ public:
+  Float16() = default;
+  constexpr explicit Float16(uint16_t value) : value_(value) {}
+
+  /// \brief Create a `Float16` from a 32-bit float (may lose precision)
+  static Float16 FromFloat(float f);
+
+  /// \brief Read a `Float16` from memory in native-endian byte order
+  static Float16 FromBytes(const uint8_t* src) {
+    return Float16(SafeLoadAs<uint16_t>(src));
+  }
+
+  /// \brief Read a `Float16` from memory in little-endian byte order
+  static Float16 FromLittleEndian(const uint8_t* src) {
+    return Float16(bit_util::FromLittleEndian(SafeLoadAs<uint16_t>(src)));
+  }
+
+  /// \brief Read a `Float16` from memory in big-endian byte order
+  static Float16 FromBigEndian(const uint8_t* src) {
+    return Float16(bit_util::FromBigEndian(SafeLoadAs<uint16_t>(src)));
+  }
+
+  /// \brief Return the value's integer representation
+  constexpr uint16_t bits() const { return value_; }
+  constexpr explicit operator uint16_t() const { return bits(); }
+
+  /// \brief Return true if the value is negative (sign bit is set)
+  constexpr bool signbit() const { return (value_ & 0x8000) != 0; }
+
+  /// \brief Return true if the value is NaN
+  constexpr bool is_nan() const {
+    return (value_ & 0x7c00) == 0x7c00 && (value_ & 0x03ff) != 0;
+  }
+  /// \brief Return true if the value is positive/negative infinity
+  constexpr bool is_infinity() const { return (value_ & 0x7fff) == 0x7c00; }
+  /// \brief Return true if the value is positive/negative zero
+  constexpr bool is_zero() const { return (value_ & 0x7fff) == 0; }
+
+  /// \brief Convert to a 32-bit float
+  float ToFloat() const;
+
+  /// \brief Copy the value's bytes in native-endian byte order
+  void ToBytes(uint8_t* dest) const { std::memcpy(dest, &value_, sizeof(value_)); }
+  /// \brief Return the value's bytes in native-endian byte order
+  constexpr std::array<uint8_t, 2> ToBytes() const {
+#if ARROW_LITTLE_ENDIAN
+    return ToLittleEndian();
+#else
+    return ToBigEndian();
+#endif
+  }
+
+  /// \brief Copy the value's bytes in little-endian byte order
+  void ToLittleEndian(uint8_t* dest) const {
+    Float16{bit_util::ToLittleEndian(value_)}.ToBytes(dest);
+  }
+  /// \brief Return the value's bytes in little-endian byte order
+  constexpr std::array<uint8_t, 2> ToLittleEndian() const {
+#if ARROW_LITTLE_ENDIAN
+    return {uint8_t(value_ & 0xff), uint8_t(value_ >> 8)};
+#else
+    return {uint8_t(value_ >> 8), uint8_t(value_ & 0xff)};
+#endif
+  }
+
+  /// \brief Copy the value's bytes in big-endian byte order
+  void ToBigEndian(uint8_t* dest) const {
+    Float16{bit_util::ToBigEndian(value_)}.ToBytes(dest);
+  }
+  /// \brief Return the value's bytes in big-endian byte order
+  constexpr std::array<uint8_t, 2> ToBigEndian() const {
+#if ARROW_LITTLE_ENDIAN
+    return {uint8_t(value_ >> 8), uint8_t(value_ & 0xff)};
+#else
+    return {uint8_t(value_ & 0xff), uint8_t(value_ >> 8)};
+#endif
+  }
+
+  constexpr Float16 operator-() const { return Float16(value_ ^ 0x8000); }
+  constexpr Float16 operator+() const { return Float16(value_); }
+
+  friend constexpr bool operator==(Float16 lhs, Float16 rhs) {
+    if (lhs.is_nan() || rhs.is_nan()) return false;
+    return Float16::CompareEq(lhs, rhs);
+  }
+  friend constexpr bool operator!=(Float16 lhs, Float16 rhs) { return !(lhs == rhs); }
+
+  friend constexpr bool operator<(Float16 lhs, Float16 rhs) {
+    if (lhs.is_nan() || rhs.is_nan()) return false;
+    return Float16::CompareLt(lhs, rhs);
+  }
+  friend constexpr bool operator>(Float16 lhs, Float16 rhs) { return rhs < lhs; }
+
+  friend constexpr bool operator<=(Float16 lhs, Float16 rhs) {
+    if (lhs.is_nan() || rhs.is_nan()) return false;
+    return !Float16::CompareLt(rhs, lhs);
+  }
+  friend constexpr bool operator>=(Float16 lhs, Float16 rhs) { return rhs <= lhs; }
+
+  ARROW_FRIEND_EXPORT friend std::ostream& operator<<(std::ostream& os, Float16 arg);
+
+ protected:
+  uint16_t value_;
+
+ private:
+  // Comparison helpers that assume neither operand is NaN
+  static constexpr bool CompareEq(Float16 lhs, Float16 rhs) {
+    return (lhs.bits() == rhs.bits()) || (lhs.is_zero() && rhs.is_zero());
+  }
+  static constexpr bool CompareLt(Float16 lhs, Float16 rhs) {
+    if (lhs.signbit()) {
+      if (rhs.signbit()) {
+        // Both are negative
+        return lhs.bits() > rhs.bits();
+      } else {
+        // Handle +/-0
+        return !lhs.is_zero() || rhs.bits() != 0;
+      }
+    } else if (rhs.signbit()) {
+      return false;
+    } else {
+      // Both are positive
+      return lhs.bits() < rhs.bits();
+    }
+  }
+};
+
+static_assert(std::is_trivial_v<Float16>);
+
+}  // namespace util
+}  // namespace arrow
+
+// TODO: Not complete

Review Comment:
   Dropping overloads and specializations into std is UB except where specifically allowed, and I don't see that noted for [isnan](https://eel.is/c++draft/cmath.syn). We could get around that a little by having `isnan.h`:
   
   ```c++
   #include <cmath>
   #include "arrow/util/float16.h"
   namespace arrow::util {
   using std::isnan;
   constexpr bool isnan(Float16 f) { return f.isnan(); }
   }
   ```
   
   ... I'm not sure if that's worthwhile though; we'd only get any benefit out of that when we need to call isnan and we happen to be in a sufficiently generic context that we don't wish to have explicitly different code for `float` and `Float16`. I'd guess it's not worth the trouble



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] benibus commented on a diff in pull request #36073: GH-36036: [C++][Parquet] Implement Float16 logical type

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1234470603


##########
cpp/src/arrow/util/float16.h:
##########
@@ -0,0 +1,179 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <array>
+#include <cmath>
+#include <cstdint>
+#include <cstring>
+#include <limits>
+#include <type_traits>
+
+#include "arrow/util/bit_util.h"
+#include "arrow/util/ubsan.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// \brief Base class for an IEEE half-precision float, encoded as a `uint16_t`
+///
+/// The exact format is as follows (from MSB to LSB):
+/// - bit 0:     sign
+/// - bits 1-5:  exponent
+/// - bits 6-15: mantissa
+///
+/// NOTE: Methods in the class should not mutate the unerlying value or produce copies.
+/// Such functionality is delegated to subclasses.
+class Float16Base {
+ public:
+  Float16Base() = default;
+  constexpr explicit Float16Base(uint16_t value) : value_(value) {}
+
+  constexpr uint16_t bits() const { return value_; }
+  constexpr explicit operator uint16_t() const { return bits(); }
+
+  constexpr bool signbit() const { return (value_ & 0x8000) != 0; }
+
+  constexpr bool is_nan() const {
+    return (value_ & 0x7c00) == 0x7c00 && (value_ & 0x03ff) != 0;
+  }
+  constexpr bool is_infinity() const { return (value_ & 0x7fff) == 0x7c00; }
+  constexpr bool is_zero() const { return (value_ & 0x7fff) == 0; }
+
+  /// \brief Copy the value's bytes in native-endian byte order
+  void ToBytes(uint8_t* dest) const { std::memcpy(dest, &value_, sizeof(value_)); }
+  /// \brief Return the value's bytes in native-endian byte order
+  std::array<uint8_t, 2> ToBytes() const {
+    std::array<uint8_t, 2> bytes;
+    ToBytes(bytes.data());
+    return bytes;
+  }
+
+  void ToLittleEndian(uint8_t* dest) const {
+    Float16Base{bit_util::ToLittleEndian(value_)}.ToBytes(dest);
+  }
+  std::array<uint8_t, 2> ToLittleEndian() const {
+    return Float16Base{bit_util::ToLittleEndian(value_)}.ToBytes();
+  }
+
+  void ToBigEndian(uint8_t* dest) const {

Review Comment:
   Not sure how generally useful it would be either. I only really added it for symmetry given that the little-endian requirement is Parquet-specific.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36073: GH-36036: [C++][Parquet] Implement Float16 logical type

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1230977677


##########
cpp/src/parquet/float_internal.h:
##########
@@ -0,0 +1,61 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <cstdint>
+#include <cstring>
+
+#include "arrow/util/bit_util.h"
+#include "arrow/util/ubsan.h"
+#include "parquet/types.h"
+
+namespace parquet {
+
+struct float16 {

Review Comment:
   We might even want to define a `arrow::Float16` struct for easier handling of half-precision floats (together with a bunch of operators, and perhaps overrides for `std::isnan`, `std::numeric_limits` and friends)?
   
   For example:
   ```c++
   class Float16 {
    public:
     Float16() = default;
     explicit Float16(uint16_t value) : value_(value) {}
   
     static Float16 FromBytes(const uint8_t*) { ... }
     static Float16 FromBytes(std::array<uint8_t, 2>) { ... }
     std::array<uint8_t, 2> ToBytes() const { ... }
   
     uint16_t bits() const { return value_; }
   
     friend bool operator== ...
   
    protected:
     uint16_t value_;
   };
   
   static_assert(std::is_trivial_v<Float16>);
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] benibus commented on a diff in pull request #36073: GH-36036: [C++][Parquet] Implement Float16 logical type

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1232863123


##########
cpp/src/parquet/statistics.cc:
##########
@@ -277,11 +278,54 @@ template <bool is_signed>
 struct CompareHelper<FLBAType, is_signed>
     : public BinaryLikeCompareHelperBase<FLBAType, is_signed> {};
 
+struct Float16CompareHelper {
+  using T = FLBA;
+
+  static T DefaultMin() { return T{float16::max_ptr()}; }
+  static T DefaultMax() { return T{float16::min_ptr()}; }

Review Comment:
   They're actually meant to be swapped like that (the impl basically treats them as null-like values). For instance, this is the equivalent for native floats: https://github.com/apache/arrow/blob/5e993f9ca402199a613d7abaea163c5cd9b3dccb/cpp/src/parquet/statistics.cc#L84-L85



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #36073: GH-36036: [C++][Parquet] Implement Float16 logical type

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1593292940

   > Is this in line with what I've done here (i.e. the `FLBA` itself representing a little-endian binary float)
   
   Yes!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade commented on pull request #36073: GH-36036: [C++][Parquet] Implement Float16 logical type

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.
zeroshade commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1662855681

   @benibus Should the float16 type's usage in the Parquet code be put behind a Parquet version check? i.e. should it error out if reading/writing a Parquet v1.0 file as opposed to a v2.6 or whatever?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] benibus commented on pull request #36073: GH-36036: [C++][Python][Parquet] Implement Float16 logical type

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1671928609

   @wgtmac Thanks for the update! I should mention that I actually made some progress on the Java side myself - although I didn't get very deep into the more substantial stuff (statistics, comparators). That being said, I'd be happy to let @zhangjiashen take the wheel since it'd allow me to focus on the C++/Go side (I dont _really_ know Java) - plus, I'll be away next week. If you want, I can provide my work in some form as a starting point.
   
   I'm happy to help where I can, of course - particularly with the floating point logic. Right now I'm working on conversions between short/half-precision floats for this PR, so if the need arises for similar functionality in parquet-mr then stay tuned.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-36036: [C++][Python][Parquet] Implement Float16 logical type [arrow]

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1811253615

   Update: I don't think it'd be very useful to add tests for different type lengths at the reader/writer level since it isn't even possible to construct a `PrimitiveNode` with an inapplicable length without throwing an exception. I suppose that's why there aren't similar test for other (e.g. decimal, uuid) types AFAICT.
   
   However, it turns out that I forgot to add the relevant tests to `schema_test.cc`, so I went ahead and included those - along with the other requested changes. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #36073: GH-36036: [C++][Python][Parquet] Implement Float16 logical type

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1285481082


##########
cpp/src/parquet/statistics.cc:
##########
@@ -527,9 +616,28 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
 
   void IncrementNumValues(int64_t n) override { num_values_ += n; }
 
+  static bool IsMeaningfulLogicalType(LogicalType::Type::type type) {

Review Comment:
   Emm... May I ask why only F16 is the meaningFul logical type?



##########
cpp/src/arrow/util/float16.cc:
##########
@@ -0,0 +1,28 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <ostream>
+
+#include "arrow/util/float16.h"
+
+namespace arrow {
+namespace util {
+
+std::ostream& operator<<(std::ostream& os, Float16Base arg) { return (os << arg.bits()); }

Review Comment:
   Agree, maybe print float is better?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36073: GH-36036: [C++][Python][Parquet] Implement Float16 logical type

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1302751308


##########
cpp/src/arrow/util/float16.cc:
##########
@@ -0,0 +1,187 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <ostream>
+
+#include "arrow/util/float16.h"
+#include "arrow/util/ubsan.h"
+
+namespace arrow {
+namespace util {
+
+namespace {
+
+// --------------------------------------------------------
+// Binary conversions
+// --------------------------------------------------------
+// These routines are partially adapted from Numpy's C implementation
+//
+// Some useful metrics for conversions between different precisions:
+// |-----------------------------------------|
+// | precision | half    | single  | double  |
+// |-----------------------------------------|
+// | mantissa  | 10 bits | 23 bits | 53 bits |

Review Comment:
   Should it be "52 bits" for double? In these three cases, a leading 1 is implied without being materialized AFAIU.



##########
cpp/src/arrow/util/float16.h:
##########
@@ -0,0 +1,192 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <array>
+#include <cstdint>
+#include <cstring>
+#include <iosfwd>
+#include <limits>
+#include <type_traits>
+
+#include "arrow/util/endian.h"
+#include "arrow/util/ubsan.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// \brief Class representing an IEEE half-precision float, encoded as a `uint16_t`
+///
+/// The exact format is as follows (from LSB to MSB):
+/// - bits 0-10:  mantissa
+/// - bits 10-15: exponent
+/// - bit 15:     sign
+///
+class ARROW_EXPORT Float16 {
+ public:
+  Float16() = default;
+  constexpr explicit Float16(uint16_t value) : value_(value) {}
+
+  /// \brief Create a `Float16` from a 32-bit float (may lose precision)
+  static Float16 FromFloat(float f);
+
+  /// \brief Read a `Float16` from memory in native-endian byte order
+  static Float16 FromBytes(const uint8_t* src) {
+    return Float16(SafeLoadAs<uint16_t>(src));
+  }
+
+  /// \brief Read a `Float16` from memory in little-endian byte order
+  static Float16 FromLittleEndian(const uint8_t* src) {
+    return Float16(bit_util::FromLittleEndian(SafeLoadAs<uint16_t>(src)));
+  }
+
+  /// \brief Read a `Float16` from memory in big-endian byte order
+  static Float16 FromBigEndian(const uint8_t* src) {
+    return Float16(bit_util::FromBigEndian(SafeLoadAs<uint16_t>(src)));
+  }
+
+  /// \brief Return the value's integer representation
+  constexpr uint16_t bits() const { return value_; }
+  constexpr explicit operator uint16_t() const { return bits(); }
+
+  /// \brief Return true if the value is negative (sign bit is set)
+  constexpr bool signbit() const { return (value_ & 0x8000) != 0; }
+
+  /// \brief Return true if the value is NaN
+  constexpr bool is_nan() const {
+    return (value_ & 0x7c00) == 0x7c00 && (value_ & 0x03ff) != 0;
+  }
+  /// \brief Return true if the value is positive/negative infinity
+  constexpr bool is_infinity() const { return (value_ & 0x7fff) == 0x7c00; }
+  /// \brief Return true if the value is positive/negative zero
+  constexpr bool is_zero() const { return (value_ & 0x7fff) == 0; }

Review Comment:
   For completeness, add a `is_finite` method? (meaning neither inf nor nan)



##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -61,6 +63,18 @@ using schema::PrimitiveNode;
 
 namespace test {
 
+struct BufferedFloat16 {
+  explicit BufferedFloat16(Float16 f16)
+      : f16(f16), buffer(*::arrow::AllocateBuffer(sizeof(uint16_t))) {
+    this->f16.ToLittleEndian(buffer->mutable_data());
+  }
+  explicit BufferedFloat16(uint16_t bits) : BufferedFloat16(Float16(bits)) {}
+  const uint8_t* bytes() const { return buffer->data(); }
+
+  Float16 f16;
+  std::shared_ptr<::arrow::Buffer> buffer;

Review Comment:
   Perhaps this can simply be a `std::array<uint8_t, 2>`? The buffer itself doesn't seem used below, unless I'm missing something.



##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -1082,6 +1109,45 @@ void TestStatisticsSortOrder<FLBAType>::SetValues() {
       .set_max(std::string(reinterpret_cast<const char*>(&vals[8][0]), FLBA_LENGTH));
 }
 
+template <>
+void TestStatisticsSortOrder<Float16LogicalType>::AddNodes(std::string name) {
+  auto node =
+      schema::PrimitiveNode::Make(name, Repetition::REQUIRED, LogicalType::Float16(),
+                                  Type::FIXED_LEN_BYTE_ARRAY, sizeof(uint16_t));
+  fields_.push_back(std::move(node));
+}
+
+template <>
+void TestStatisticsSortOrder<Float16LogicalType>::SetValues() {
+  constexpr int kValueLen = 2;
+  constexpr int kNumBytes = NUM_VALUES * kValueLen;
+
+  const uint16_t u16_vals[NUM_VALUES] = {
+      0b1100010100000000,  // -5.0
+      0b1100010000000000,  // -4.0
+      0b1100001000000000,  // -3.0
+      0b1100000000000000,  // -2.0
+      0b1011110000000000,  // -1.0
+      0b0000000000000000,  // +0.0
+      0b0011110000000000,  // +1.0
+      0b0100000000000000,  // +2.0
+      0b0100001000000000,  // +3.0
+      0b0100010000000000,  // +4.0

Review Comment:
   Can you reorder the values so that they are not pre-sorted? This would exercise the minmax routines a bit better.



##########
cpp/src/parquet/arrow/reader_internal.cc:
##########
@@ -713,6 +715,77 @@ Status TransferDecimal(RecordReader* reader, MemoryPool* pool,
   return Status::OK();
 }
 
+static inline Status ConvertToHalfFloat(const Array& array,
+                                        const std::shared_ptr<DataType>& type,
+                                        MemoryPool* pool, std::shared_ptr<Array>* out) {
+  constexpr int32_t byte_width = sizeof(uint16_t);
+  DCHECK_EQ(checked_cast<const ::arrow::HalfFloatType&>(*type).byte_width(), byte_width);
+
+  // We read the halffloat (uint16_t) bytes from a raw binary array, in which they're
+  // assumed to be little-endian.
+  const auto& binary_array = checked_cast<const ::arrow::FixedSizeBinaryArray&>(array);
+  DCHECK_EQ(checked_cast<const ::arrow::FixedSizeBinaryType&>(*binary_array.type())
+                .byte_width(),
+            byte_width);
+
+  // Number of elements in the halffloat array
+  const int64_t length = binary_array.length();
+  // Allocate data for the output halffloat array
+  ARROW_ASSIGN_OR_RAISE(auto data, ::arrow::AllocateBuffer(length * byte_width, pool));
+  uint8_t* out_ptr = data->mutable_data();
+
+  const int64_t null_count = binary_array.null_count();
+  // Copy the values to the output array in native-endian format
+  if (null_count > 0) {
+    for (int64_t i = 0; i < length; ++i, out_ptr += byte_width) {
+      Float16 f16{0};
+      if (binary_array.IsValid(i)) {
+        const uint8_t* in_ptr = binary_array.GetValue(i);
+        f16 = Float16::FromLittleEndian(in_ptr);
+      }
+      f16.ToBytes(out_ptr);
+    }
+  } else {
+#if ARROW_LITTLE_ENDIAN
+    // No need to byte-swap, so do a simple copy
+    std::memcpy(out_ptr, binary_array.raw_values(), length * byte_width);
+#else
+    for (int64_t i = 0; i < length; ++i, out_ptr += byte_width) {
+      const uint8_t* in_ptr = binary_array.GetValue(i);
+      Float16::FromLittleEndian(in_ptr).ToBytes(out_ptr);
+    }
+#endif
+  }
+
+  *out = std::make_shared<::arrow::HalfFloatArray>(
+      type, length, std::move(data), binary_array.null_bitmap(), null_count);
+  return Status::OK();
+}
+
+/// \brief Convert an arrow::BinaryArray to an arrow::HalfFloatArray
+/// We do this by:
+/// 1. Creating an arrow::BinaryArray from the RecordReader's builder
+/// 2. Allocating a buffer for the arrow::HalfFloatArray
+/// 3. Converting the little-endian bytes in each BinaryArray entry to native-endian
+/// halffloat (uint16_t) values
+Status TransferHalfFloat(RecordReader* reader, MemoryPool* pool,
+                         const std::shared_ptr<Field>& field, Datum* out) {
+  auto binary_reader = dynamic_cast<BinaryRecordReader*>(reader);
+  DCHECK(binary_reader);
+  ::arrow::ArrayVector chunks = binary_reader->GetBuilderChunks();

Review Comment:
   One could just reuse `TransferBinary` here, and then view the resulting `FixedSizeBinaryArray` as `float16`.



##########
cpp/src/parquet/arrow/reader_internal.cc:
##########
@@ -772,6 +845,13 @@ Status TransferColumnData(RecordReader* reader, const std::shared_ptr<Field>& va
       RETURN_NOT_OK(TransferBinary(reader, pool, value_field, &chunked_result));
       result = chunked_result;
     } break;
+    case ::arrow::Type::HALF_FLOAT: {
+      if (descr->physical_type() != ::parquet::Type::FIXED_LEN_BYTE_ARRAY) {
+        return Status::Invalid("Physical type for ", value_field->type()->ToString(),
+                               " must be fixed length binary");
+      }

Review Comment:
   Do you also want to check the FLBA width?



##########
cpp/src/parquet/arrow/test_util.h:
##########
@@ -65,12 +65,44 @@ struct Decimal256WithPrecisionAndScale {
   static constexpr int32_t scale = PRECISION - 1;
 };
 
+inline std::vector<uint16_t> RandomHalfFloatValues(size_t size, uint16_t min,

Review Comment:
   And perhaps `random_real` can be enhanced to accept a `Float16` type parameter?



##########
cpp/src/parquet/arrow/arrow_reader_writer_test.cc:
##########
@@ -916,9 +925,15 @@ TYPED_TEST(TestParquetIO, SingleColumnOptionalReadWrite) {
 }
 
 TYPED_TEST(TestParquetIO, SingleColumnOptionalDictionaryWrite) {
-  // Skip tests for BOOL as we don't create dictionaries for it.
-  if (TypeParam::type_id == ::arrow::Type::BOOL) {
-    return;
+  switch (TypeParam::type_id) {
+    // Skip tests for BOOL as we don't create dictionaries for it.
+    case ::arrow::Type::BOOL:
+    // Skip tests for HALF_FLOAT as it's not currently supported by `dictionary_encode`
+    case ::arrow::Type::HALF_FLOAT:
+      GTEST_SKIP();

Review Comment:
   Let's add a skip message for clarity (though I'm not sure GTest always displays it).
   ```suggestion
         GTEST_SKIP() << "dictionary_encode not supported for float16" ;
   ```



##########
cpp/src/arrow/util/float16.h:
##########
@@ -0,0 +1,192 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <array>
+#include <cstdint>
+#include <cstring>
+#include <iosfwd>
+#include <limits>
+#include <type_traits>
+
+#include "arrow/util/endian.h"
+#include "arrow/util/ubsan.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// \brief Class representing an IEEE half-precision float, encoded as a `uint16_t`
+///
+/// The exact format is as follows (from LSB to MSB):
+/// - bits 0-10:  mantissa
+/// - bits 10-15: exponent
+/// - bit 15:     sign
+///
+class ARROW_EXPORT Float16 {
+ public:
+  Float16() = default;
+  constexpr explicit Float16(uint16_t value) : value_(value) {}
+
+  /// \brief Create a `Float16` from a 32-bit float (may lose precision)
+  static Float16 FromFloat(float f);
+
+  /// \brief Read a `Float16` from memory in native-endian byte order
+  static Float16 FromBytes(const uint8_t* src) {
+    return Float16(SafeLoadAs<uint16_t>(src));
+  }
+
+  /// \brief Read a `Float16` from memory in little-endian byte order
+  static Float16 FromLittleEndian(const uint8_t* src) {
+    return Float16(bit_util::FromLittleEndian(SafeLoadAs<uint16_t>(src)));
+  }
+
+  /// \brief Read a `Float16` from memory in big-endian byte order
+  static Float16 FromBigEndian(const uint8_t* src) {
+    return Float16(bit_util::FromBigEndian(SafeLoadAs<uint16_t>(src)));
+  }
+
+  /// \brief Return the value's integer representation
+  constexpr uint16_t bits() const { return value_; }
+  constexpr explicit operator uint16_t() const { return bits(); }
+
+  /// \brief Return true if the value is negative (sign bit is set)
+  constexpr bool signbit() const { return (value_ & 0x8000) != 0; }
+
+  /// \brief Return true if the value is NaN
+  constexpr bool is_nan() const {
+    return (value_ & 0x7c00) == 0x7c00 && (value_ & 0x03ff) != 0;
+  }
+  /// \brief Return true if the value is positive/negative infinity
+  constexpr bool is_infinity() const { return (value_ & 0x7fff) == 0x7c00; }
+  /// \brief Return true if the value is positive/negative zero
+  constexpr bool is_zero() const { return (value_ & 0x7fff) == 0; }
+
+  /// \brief Convert to a 32-bit float
+  float ToFloat() const;

Review Comment:
   Also a `ToDouble` for completeness?



##########
cpp/src/arrow/util/float16_test.cc:
##########
@@ -0,0 +1,246 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <array>
+#include <cmath>
+#include <utility>
+#include <vector>
+
+#include <gtest/gtest.h>
+
+#include "arrow/testing/gtest_util.h"
+#include "arrow/util/endian.h"
+#include "arrow/util/float16.h"
+#include "arrow/util/ubsan.h"
+
+namespace arrow {
+namespace util {
+namespace {
+
+template <typename T>
+using Limits = std::numeric_limits<T>;
+
+float F32(uint32_t bits) { return SafeCopy<float>(bits); }
+
+TEST(Float16Test, RoundTripFromFloat32) {
+  struct TestCase {
+    float f32;
+    uint16_t b16;
+    float f16_as_f32;
+  };
+  // Expected values were also manually validated with numpy-1.24.3
+  const TestCase test_cases[] = {
+      // +/-0.0f
+      {F32(0x80000000u), 0b1000000000000000u, -0.0f},
+      {F32(0x00000000u), 0b0000000000000000u, +0.0f},
+      // 32-bit exp is 102 => 2^-25. Rounding to nearest.
+      {F32(0xb3000001u), 0b1000000000000001u, -5.96046447754e-8f},
+      // 32-bit exp is 102 => 2^-25. Rounding to even.
+      {F32(0xb3000000u), 0b1000000000000000u, -0.0f},
+      // 32-bit exp is 101 => 2^-26. Underflow to zero.
+      {F32(0xb2800001u), 0b1000000000000000u, -0.0f},
+      // 32-bit exp is 108 => 2^-19.
+      {F32(0xb61a0000u), 0b1000000000100110u, -2.26497650146e-6f},
+      // 32-bit exp is 108 => 2^-19.
+      {F32(0xb61e0000u), 0b1000000000101000u, -2.38418579102e-6f},
+      // 32-bit exp is 112 => 2^-15. Rounding to nearest.
+      {F32(0xb87fa001u), 0b1000001111111111u, -6.09755516052e-5f},
+      // 32-bit exp is 112 => 2^-15. Rounds to 16-bit exp of 1 => 2^-14
+      {F32(0xb87fe001u), 0b1000010000000000u, -6.103515625e-5f},
+      // 32-bit exp is 142 => 2^15. Rounding to nearest.
+      {F32(0xc7001001u), 0b1111100000000001u, -32800.0f},
+      // 32-bit exp is 142 => 2^15. Rounding to even.
+      {F32(0xc7001000u), 0b1111100000000000u, -32768.0f},
+      // 65520.0f rounds to inf
+      {F32(0x477ff000u), 0b0111110000000000u, Limits<float>::infinity()},
+      // 65488.0039062f rounds to 65504.0 (float16 max)
+      {F32(0x477fd001u), 0b0111101111111111u, 65504.0f},
+      // 32-bit exp is 127 => 2^0, rounds to 16-bit exp of 16 => 2^1.
+      {F32(0xbffff000u), 0b1100000000000000u, -2.0f},
+  };
+
+  for (size_t index = 0; index < std::size(test_cases); ++index) {
+    ARROW_SCOPED_TRACE("index=", index);
+    const auto& tc = test_cases[index];
+    const auto f16 = Float16::FromFloat(tc.f32);
+    EXPECT_EQ(tc.b16, f16.bits());
+    EXPECT_EQ(tc.f16_as_f32, f16.ToFloat());

Review Comment:
   Also test predicates? For example:
   ```suggestion
       EXPECT_EQ(tc.f16_as_f32, f16.ToFloat());
       EXPECT_EQ(std::signbit(tc.f16_as_f32), f16.signbit());
       EXPECT_EQ(std::isnan(tc.f16_as_f32), f16.isnan());
       EXPECT_EQ(std::isinf(tc.f16_as_f32), f16.isinf());
   ```
   



##########
cpp/src/arrow/util/float16.h:
##########
@@ -0,0 +1,192 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <array>
+#include <cstdint>
+#include <cstring>
+#include <iosfwd>
+#include <limits>
+#include <type_traits>
+
+#include "arrow/util/endian.h"
+#include "arrow/util/ubsan.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// \brief Class representing an IEEE half-precision float, encoded as a `uint16_t`
+///
+/// The exact format is as follows (from LSB to MSB):
+/// - bits 0-10:  mantissa
+/// - bits 10-15: exponent
+/// - bit 15:     sign
+///
+class ARROW_EXPORT Float16 {
+ public:
+  Float16() = default;
+  constexpr explicit Float16(uint16_t value) : value_(value) {}
+
+  /// \brief Create a `Float16` from a 32-bit float (may lose precision)
+  static Float16 FromFloat(float f);
+
+  /// \brief Read a `Float16` from memory in native-endian byte order
+  static Float16 FromBytes(const uint8_t* src) {
+    return Float16(SafeLoadAs<uint16_t>(src));
+  }
+
+  /// \brief Read a `Float16` from memory in little-endian byte order
+  static Float16 FromLittleEndian(const uint8_t* src) {
+    return Float16(bit_util::FromLittleEndian(SafeLoadAs<uint16_t>(src)));
+  }
+
+  /// \brief Read a `Float16` from memory in big-endian byte order
+  static Float16 FromBigEndian(const uint8_t* src) {
+    return Float16(bit_util::FromBigEndian(SafeLoadAs<uint16_t>(src)));
+  }
+
+  /// \brief Return the value's integer representation
+  constexpr uint16_t bits() const { return value_; }
+  constexpr explicit operator uint16_t() const { return bits(); }
+
+  /// \brief Return true if the value is negative (sign bit is set)
+  constexpr bool signbit() const { return (value_ & 0x8000) != 0; }
+
+  /// \brief Return true if the value is NaN
+  constexpr bool is_nan() const {
+    return (value_ & 0x7c00) == 0x7c00 && (value_ & 0x03ff) != 0;
+  }
+  /// \brief Return true if the value is positive/negative infinity
+  constexpr bool is_infinity() const { return (value_ & 0x7fff) == 0x7c00; }
+  /// \brief Return true if the value is positive/negative zero
+  constexpr bool is_zero() const { return (value_ & 0x7fff) == 0; }
+
+  /// \brief Convert to a 32-bit float
+  float ToFloat() const;
+
+  /// \brief Copy the value's bytes in native-endian byte order
+  void ToBytes(uint8_t* dest) const { std::memcpy(dest, &value_, sizeof(value_)); }
+  /// \brief Return the value's bytes in native-endian byte order
+  constexpr std::array<uint8_t, 2> ToBytes() const {
+#if ARROW_LITTLE_ENDIAN
+    return ToLittleEndian();
+#else
+    return ToBigEndian();
+#endif
+  }
+
+  /// \brief Copy the value's bytes in little-endian byte order
+  void ToLittleEndian(uint8_t* dest) const {
+    Float16{bit_util::ToLittleEndian(value_)}.ToBytes(dest);
+  }
+  /// \brief Return the value's bytes in little-endian byte order
+  constexpr std::array<uint8_t, 2> ToLittleEndian() const {
+#if ARROW_LITTLE_ENDIAN
+    return {uint8_t(value_ & 0xff), uint8_t(value_ >> 8)};
+#else
+    return {uint8_t(value_ >> 8), uint8_t(value_ & 0xff)};
+#endif
+  }
+
+  /// \brief Copy the value's bytes in big-endian byte order
+  void ToBigEndian(uint8_t* dest) const {
+    Float16{bit_util::ToBigEndian(value_)}.ToBytes(dest);
+  }
+  /// \brief Return the value's bytes in big-endian byte order
+  constexpr std::array<uint8_t, 2> ToBigEndian() const {
+#if ARROW_LITTLE_ENDIAN
+    return {uint8_t(value_ >> 8), uint8_t(value_ & 0xff)};
+#else
+    return {uint8_t(value_ & 0xff), uint8_t(value_ >> 8)};
+#endif
+  }
+
+  constexpr Float16 operator-() const { return Float16(value_ ^ 0x8000); }
+  constexpr Float16 operator+() const { return Float16(value_); }
+
+  friend constexpr bool operator==(Float16 lhs, Float16 rhs) {
+    if (lhs.is_nan() || rhs.is_nan()) return false;
+    return Float16::CompareEq(lhs, rhs);
+  }
+  friend constexpr bool operator!=(Float16 lhs, Float16 rhs) { return !(lhs == rhs); }
+
+  friend constexpr bool operator<(Float16 lhs, Float16 rhs) {
+    if (lhs.is_nan() || rhs.is_nan()) return false;
+    return Float16::CompareLt(lhs, rhs);
+  }
+  friend constexpr bool operator>(Float16 lhs, Float16 rhs) { return rhs < lhs; }
+
+  friend constexpr bool operator<=(Float16 lhs, Float16 rhs) {
+    if (lhs.is_nan() || rhs.is_nan()) return false;
+    return !Float16::CompareLt(rhs, lhs);
+  }
+  friend constexpr bool operator>=(Float16 lhs, Float16 rhs) { return rhs <= lhs; }
+
+  ARROW_FRIEND_EXPORT friend std::ostream& operator<<(std::ostream& os, Float16 arg);
+
+ protected:
+  uint16_t value_;
+
+ private:
+  // Comparison helpers that assume neither operand is NaN
+  static constexpr bool CompareEq(Float16 lhs, Float16 rhs) {
+    return (lhs.bits() == rhs.bits()) || (lhs.is_zero() && rhs.is_zero());
+  }
+  static constexpr bool CompareLt(Float16 lhs, Float16 rhs) {
+    if (lhs.signbit()) {
+      if (rhs.signbit()) {
+        // Both are negative
+        return lhs.bits() > rhs.bits();
+      } else {
+        // Handle +/-0
+        return !lhs.is_zero() || rhs.bits() != 0;
+      }
+    } else if (rhs.signbit()) {
+      return false;
+    } else {
+      // Both are positive
+      return lhs.bits() < rhs.bits();
+    }
+  }
+};
+
+static_assert(std::is_trivial_v<Float16>);
+
+}  // namespace util
+}  // namespace arrow
+
+// TODO: Not complete

Review Comment:
   By the way, do you think it would be admissible to specialize `std::isnan` and friends for `Float16`? @bkietz 



##########
cpp/src/arrow/util/float16.cc:
##########
@@ -0,0 +1,187 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <ostream>
+
+#include "arrow/util/float16.h"
+#include "arrow/util/ubsan.h"
+
+namespace arrow {
+namespace util {
+
+namespace {
+
+// --------------------------------------------------------
+// Binary conversions
+// --------------------------------------------------------
+// These routines are partially adapted from Numpy's C implementation

Review Comment:
   (note for another issue/PR: interestingly Numpy has started adding HW accelerations for these: https://github.com/numpy/numpy/blob/main/numpy/core/src/common/half.hpp#L55)



##########
cpp/src/parquet/arrow/arrow_schema_test.cc:
##########
@@ -851,6 +851,8 @@ TEST_F(TestConvertArrowSchema, ArrowFields) {
        ParquetType::FIXED_LEN_BYTE_ARRAY, 7},
       {"decimal(32, 8)", ::arrow::decimal(32, 8), LogicalType::Decimal(32, 8),
        ParquetType::FIXED_LEN_BYTE_ARRAY, 14},
+      {"float16", ::arrow::float16(), LogicalType::Float16(),
+       ParquetType::FIXED_LEN_BYTE_ARRAY, 2},

Review Comment:
   This will test Arrow to Parquet, can also you add a test for the other way round in `TestConvertParquetSchema`?



##########
cpp/src/parquet/arrow/test_util.h:
##########
@@ -65,12 +65,44 @@ struct Decimal256WithPrecisionAndScale {
   static constexpr int32_t scale = PRECISION - 1;
 };
 
+inline std::vector<uint16_t> RandomHalfFloatValues(size_t size, uint16_t min,

Review Comment:
   Hmm, this function is quite complicated and also won't generate a uniform distribution of values like `random_real` does, will it?
   How about simply generating random floats with `random_real` and then casting them to `Float16`?



##########
cpp/src/parquet/arrow/reader_internal.cc:
##########
@@ -713,6 +715,77 @@ Status TransferDecimal(RecordReader* reader, MemoryPool* pool,
   return Status::OK();
 }
 
+static inline Status ConvertToHalfFloat(const Array& array,
+                                        const std::shared_ptr<DataType>& type,
+                                        MemoryPool* pool, std::shared_ptr<Array>* out) {
+  constexpr int32_t byte_width = sizeof(uint16_t);
+  DCHECK_EQ(checked_cast<const ::arrow::HalfFloatType&>(*type).byte_width(), byte_width);
+
+  // We read the halffloat (uint16_t) bytes from a raw binary array, in which they're
+  // assumed to be little-endian.
+  const auto& binary_array = checked_cast<const ::arrow::FixedSizeBinaryArray&>(array);
+  DCHECK_EQ(checked_cast<const ::arrow::FixedSizeBinaryType&>(*binary_array.type())
+                .byte_width(),
+            byte_width);
+
+  // Number of elements in the halffloat array
+  const int64_t length = binary_array.length();
+  // Allocate data for the output halffloat array
+  ARROW_ASSIGN_OR_RAISE(auto data, ::arrow::AllocateBuffer(length * byte_width, pool));
+  uint8_t* out_ptr = data->mutable_data();
+
+  const int64_t null_count = binary_array.null_count();
+  // Copy the values to the output array in native-endian format
+  if (null_count > 0) {
+    for (int64_t i = 0; i < length; ++i, out_ptr += byte_width) {
+      Float16 f16{0};

Review Comment:
   Also, Parquet C++ has no big-endian compatibility currently. 



##########
cpp/src/parquet/arrow/arrow_schema_test.cc:
##########
@@ -906,22 +908,6 @@ TEST_F(TestConvertArrowSchema, ArrowFields) {
   // ASSERT_NO_FATAL_FAILURE();
 }
 
-TEST_F(TestConvertArrowSchema, ArrowNonconvertibleFields) {
-  struct FieldConstructionArguments {
-    std::string name;
-    std::shared_ptr<::arrow::DataType> datatype;
-  };
-
-  std::vector<FieldConstructionArguments> cases = {
-      {"float16", ::arrow::float16()},

Review Comment:
   Instead of removing this test, can we use another Arrow datatype whose conversion to Parquet isn't supported?



##########
cpp/src/parquet/column_writer.cc:
##########
@@ -2305,6 +2307,74 @@ struct SerializeFunctor<
   int64_t* scratch;
 };
 
+// ----------------------------------------------------------------------
+// Write Arrow to Float16
+
+// Requires a custom serializer because Float16s in Parquet are stored as a 2-byte
+// (little-endian) FLBA, whereas in Arrow they're a native `uint16_t`. Also, a temporary
+// buffer is needed if there's an endian mismatch.
+template <>
+struct SerializeFunctor<::parquet::FLBAType, ::arrow::HalfFloatType> {
+  Status Serialize(const ::arrow::HalfFloatArray& array, ArrowWriteContext* ctx,
+                   FLBA* out) {
+#if ARROW_LITTLE_ENDIAN

Review Comment:
   None of the Parquet C++ implementation is big-endian compatible, so I'm not sure you should bother specifically here. We also don't have any CI for this.



##########
cpp/src/parquet/statistics.cc:
##########
@@ -527,9 +616,28 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
 
   void IncrementNumValues(int64_t n) override { num_values_ += n; }
 
+  static bool IsMeaningfulLogicalType(LogicalType::Type::type type) {
+    switch (type) {
+      case LogicalType::Type::FLOAT16:
+        return true;
+      default:
+        return false;
+    }
+  }
+
   bool Equals(const Statistics& raw_other) const override {
     if (physical_type() != raw_other.physical_type()) return false;
 
+    const auto logical_id = LogicalTypeId(*this);

Review Comment:
   Perhaps store the logical type id in the `TypedStatisticsImpl` constructor, instead of calling `LogicalTypeId` from multiple places?



##########
cpp/src/arrow/util/float16_test.cc:
##########
@@ -0,0 +1,246 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <array>
+#include <cmath>
+#include <utility>
+#include <vector>
+
+#include <gtest/gtest.h>
+
+#include "arrow/testing/gtest_util.h"
+#include "arrow/util/endian.h"
+#include "arrow/util/float16.h"
+#include "arrow/util/ubsan.h"
+
+namespace arrow {
+namespace util {
+namespace {
+
+template <typename T>
+using Limits = std::numeric_limits<T>;
+
+float F32(uint32_t bits) { return SafeCopy<float>(bits); }
+
+TEST(Float16Test, RoundTripFromFloat32) {
+  struct TestCase {
+    float f32;
+    uint16_t b16;
+    float f16_as_f32;
+  };
+  // Expected values were also manually validated with numpy-1.24.3
+  const TestCase test_cases[] = {
+      // +/-0.0f
+      {F32(0x80000000u), 0b1000000000000000u, -0.0f},
+      {F32(0x00000000u), 0b0000000000000000u, +0.0f},
+      // 32-bit exp is 102 => 2^-25. Rounding to nearest.
+      {F32(0xb3000001u), 0b1000000000000001u, -5.96046447754e-8f},
+      // 32-bit exp is 102 => 2^-25. Rounding to even.
+      {F32(0xb3000000u), 0b1000000000000000u, -0.0f},
+      // 32-bit exp is 101 => 2^-26. Underflow to zero.
+      {F32(0xb2800001u), 0b1000000000000000u, -0.0f},
+      // 32-bit exp is 108 => 2^-19.
+      {F32(0xb61a0000u), 0b1000000000100110u, -2.26497650146e-6f},
+      // 32-bit exp is 108 => 2^-19.
+      {F32(0xb61e0000u), 0b1000000000101000u, -2.38418579102e-6f},
+      // 32-bit exp is 112 => 2^-15. Rounding to nearest.
+      {F32(0xb87fa001u), 0b1000001111111111u, -6.09755516052e-5f},
+      // 32-bit exp is 112 => 2^-15. Rounds to 16-bit exp of 1 => 2^-14
+      {F32(0xb87fe001u), 0b1000010000000000u, -6.103515625e-5f},
+      // 32-bit exp is 142 => 2^15. Rounding to nearest.
+      {F32(0xc7001001u), 0b1111100000000001u, -32800.0f},
+      // 32-bit exp is 142 => 2^15. Rounding to even.
+      {F32(0xc7001000u), 0b1111100000000000u, -32768.0f},
+      // 65520.0f rounds to inf
+      {F32(0x477ff000u), 0b0111110000000000u, Limits<float>::infinity()},
+      // 65488.0039062f rounds to 65504.0 (float16 max)
+      {F32(0x477fd001u), 0b0111101111111111u, 65504.0f},
+      // 32-bit exp is 127 => 2^0, rounds to 16-bit exp of 16 => 2^1.
+      {F32(0xbffff000u), 0b1100000000000000u, -2.0f},
+  };
+
+  for (size_t index = 0; index < std::size(test_cases); ++index) {
+    ARROW_SCOPED_TRACE("index=", index);
+    const auto& tc = test_cases[index];
+    const auto f16 = Float16::FromFloat(tc.f32);
+    EXPECT_EQ(tc.b16, f16.bits());
+    EXPECT_EQ(tc.f16_as_f32, f16.ToFloat());

Review Comment:
   Similar suggestion in other tests below.



##########
cpp/src/parquet/arrow/reader_internal.cc:
##########
@@ -713,6 +715,77 @@ Status TransferDecimal(RecordReader* reader, MemoryPool* pool,
   return Status::OK();
 }
 
+static inline Status ConvertToHalfFloat(const Array& array,
+                                        const std::shared_ptr<DataType>& type,
+                                        MemoryPool* pool, std::shared_ptr<Array>* out) {
+  constexpr int32_t byte_width = sizeof(uint16_t);
+  DCHECK_EQ(checked_cast<const ::arrow::HalfFloatType&>(*type).byte_width(), byte_width);
+
+  // We read the halffloat (uint16_t) bytes from a raw binary array, in which they're
+  // assumed to be little-endian.
+  const auto& binary_array = checked_cast<const ::arrow::FixedSizeBinaryArray&>(array);
+  DCHECK_EQ(checked_cast<const ::arrow::FixedSizeBinaryType&>(*binary_array.type())
+                .byte_width(),
+            byte_width);
+
+  // Number of elements in the halffloat array
+  const int64_t length = binary_array.length();
+  // Allocate data for the output halffloat array
+  ARROW_ASSIGN_OR_RAISE(auto data, ::arrow::AllocateBuffer(length * byte_width, pool));
+  uint8_t* out_ptr = data->mutable_data();
+
+  const int64_t null_count = binary_array.null_count();
+  // Copy the values to the output array in native-endian format
+  if (null_count > 0) {
+    for (int64_t i = 0; i < length; ++i, out_ptr += byte_width) {
+      Float16 f16{0};

Review Comment:
   memcpy isn't even needed. You can (or should be able to) just reuse the buffers. For example by calling `array.View(type)`.



##########
cpp/src/parquet/arrow/arrow_reader_writer_test.cc:
##########
@@ -916,9 +925,15 @@ TYPED_TEST(TestParquetIO, SingleColumnOptionalReadWrite) {
 }
 
 TYPED_TEST(TestParquetIO, SingleColumnOptionalDictionaryWrite) {
-  // Skip tests for BOOL as we don't create dictionaries for it.
-  if (TypeParam::type_id == ::arrow::Type::BOOL) {
-    return;
+  switch (TypeParam::type_id) {
+    // Skip tests for BOOL as we don't create dictionaries for it.
+    case ::arrow::Type::BOOL:
+    // Skip tests for HALF_FLOAT as it's not currently supported by `dictionary_encode`
+    case ::arrow::Type::HALF_FLOAT:
+      GTEST_SKIP();

Review Comment:
   Also do the same for bool above.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-36036: [C++][Python][Parquet] Implement Float16 logical type [arrow]

Posted by "ianmcook (via GitHub)" <gi...@apache.org>.
ianmcook commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1761683259

   I've been experimenting with this in C++. If I pass a `float` variable to the `Append` method of `HalfFloatBuilder`, it silently truncates it to an unsigned short integer value. If I pass a  float literal to `Append`, it truncates it to an unsigned short with a warning:
   ```
    warning: implicit conversion from 'double' to 'value_type' (aka 'unsigned short') changes value from 3.14 to 3
   ```
   
   Can we make the `Append` and `AppendValues` methods of `HalfFloatBuilder` automatically downcast `float` input to half-float?
   
    There's a [Gist here with the example code I'm using](https://gist.github.com/ianmcook/62ad82c17a0bb6cdb4f39d52785c4db9).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #36073: GH-36036: [C++][Parquet] Implement Float16 logical type

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1642305311

   > @wgtmac Are you still interested in working on the Java side of this?
   
   I can do that but probably not this month. I am too busy with an internal project of my employer. If it is of high priority and someone else has an interest, I am happy to review it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] anjakefala commented on pull request #36073: GH-36036: [C++][Parquet] Implement Float16 logical type

Posted by "anjakefala (via GitHub)" <gi...@apache.org>.
anjakefala commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1663073194

   I just pushed a commit that adds PyArrow test coverage for float16 Parquet. We can add [Python] to the PR name. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #36073: GH-36036: [C++][Python][Parquet] Implement Float16 logical type

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1302427543


##########
cpp/src/parquet/column_writer.cc:
##########
@@ -2305,6 +2307,74 @@ struct SerializeFunctor<
   int64_t* scratch;
 };
 
+// ----------------------------------------------------------------------
+// Write Arrow to Float16
+
+// Requires a custom serializer because Float16s in Parquet are stored as a 2-byte
+// (little-endian) FLBA, whereas in Arrow they're a native `uint16_t`. Also, a temporary
+// buffer is needed if there's an endian mismatch.
+template <>
+struct SerializeFunctor<::parquet::FLBAType, ::arrow::HalfFloatType> {
+  Status Serialize(const ::arrow::HalfFloatArray& array, ArrowWriteContext* ctx,
+                   FLBA* out) {
+#if ARROW_LITTLE_ENDIAN
+    return SerializeInPlace(array, ctx, out);
+#else
+    return SerializeWithScratch(array, ctx, out);
+#endif
+  }
+
+  Status SerializeInPlace(const ::arrow::HalfFloatArray& array, ArrowWriteContext*,
+                          FLBA* out) {
+    const uint16_t* values = array.raw_values();
+    if (array.null_count() == 0) {
+      for (int64_t i = 0; i < array.length(); ++i) {
+        out[i] = ToFLBA(&values[i]);
+      }
+    } else {
+      for (int64_t i = 0; i < array.length(); ++i) {
+        out[i] = array.IsValid(i) ? ToFLBA(&values[i]) : FLBA{};
+      }
+    }
+    return Status::OK();
+  }
+
+  Status SerializeWithScratch(const ::arrow::HalfFloatArray& array,
+                              ArrowWriteContext* ctx, FLBA* out) {
+    AllocateScratch(array, ctx);
+    if (array.null_count() == 0) {
+      for (int64_t i = 0; i < array.length(); ++i) {
+        out[i] = ToFLBA(array.Value(i));
+      }
+    } else {
+      for (int64_t i = 0; i < array.length(); ++i) {
+        out[i] = array.IsValid(i) ? ToFLBA(array.Value(i)) : FLBA{};
+      }
+    }
+    return Status::OK();
+  }
+
+ private:
+  FLBA ToFLBA(const uint16_t* value_ptr) const {
+    return FLBA{reinterpret_cast<const uint8_t*>(value_ptr)};
+  }
+  FLBA ToFLBA(uint16_t value) {

Review Comment:
   The code look good to me, but the two mode here is a bit confusing, would different name for these `ToFLBA` better? (avoid mis-use the `ToFLBA` and touch `scratch_` )



##########
cpp/src/parquet/arrow/reader_internal.cc:
##########
@@ -713,6 +715,77 @@ Status TransferDecimal(RecordReader* reader, MemoryPool* pool,
   return Status::OK();
 }
 
+static inline Status ConvertToHalfFloat(const Array& array,
+                                        const std::shared_ptr<DataType>& type,
+                                        MemoryPool* pool, std::shared_ptr<Array>* out) {
+  constexpr int32_t byte_width = sizeof(uint16_t);
+  DCHECK_EQ(checked_cast<const ::arrow::HalfFloatType&>(*type).byte_width(), byte_width);
+
+  // We read the halffloat (uint16_t) bytes from a raw binary array, in which they're
+  // assumed to be little-endian.
+  const auto& binary_array = checked_cast<const ::arrow::FixedSizeBinaryArray&>(array);
+  DCHECK_EQ(checked_cast<const ::arrow::FixedSizeBinaryType&>(*binary_array.type())
+                .byte_width(),
+            byte_width);
+
+  // Number of elements in the halffloat array
+  const int64_t length = binary_array.length();
+  // Allocate data for the output halffloat array
+  ARROW_ASSIGN_OR_RAISE(auto data, ::arrow::AllocateBuffer(length * byte_width, pool));
+  uint8_t* out_ptr = data->mutable_data();
+
+  const int64_t null_count = binary_array.null_count();
+  // Copy the values to the output array in native-endian format
+  if (null_count > 0) {
+    for (int64_t i = 0; i < length; ++i, out_ptr += byte_width) {
+      Float16 f16{0};

Review Comment:
   The code looks good to me, but why can't we blink memcpy when it's `LITTLE_ENDIAN`? Since if it's null, the value is undefined?



##########
cpp/src/parquet/arrow/reader_internal.cc:
##########
@@ -713,6 +715,77 @@ Status TransferDecimal(RecordReader* reader, MemoryPool* pool,
   return Status::OK();
 }
 
+static inline Status ConvertToHalfFloat(const Array& array,
+                                        const std::shared_ptr<DataType>& type,
+                                        MemoryPool* pool, std::shared_ptr<Array>* out) {
+  constexpr int32_t byte_width = sizeof(uint16_t);
+  DCHECK_EQ(checked_cast<const ::arrow::HalfFloatType&>(*type).byte_width(), byte_width);
+
+  // We read the halffloat (uint16_t) bytes from a raw binary array, in which they're
+  // assumed to be little-endian.
+  const auto& binary_array = checked_cast<const ::arrow::FixedSizeBinaryArray&>(array);
+  DCHECK_EQ(checked_cast<const ::arrow::FixedSizeBinaryType&>(*binary_array.type())
+                .byte_width(),
+            byte_width);
+
+  // Number of elements in the halffloat array
+  const int64_t length = binary_array.length();
+  // Allocate data for the output halffloat array
+  ARROW_ASSIGN_OR_RAISE(auto data, ::arrow::AllocateBuffer(length * byte_width, pool));
+  uint8_t* out_ptr = data->mutable_data();
+
+  const int64_t null_count = binary_array.null_count();
+  // Copy the values to the output array in native-endian format
+  if (null_count > 0) {
+    for (int64_t i = 0; i < length; ++i, out_ptr += byte_width) {
+      Float16 f16{0};
+      if (binary_array.IsValid(i)) {
+        const uint8_t* in_ptr = binary_array.GetValue(i);
+        f16 = Float16::FromLittleEndian(in_ptr);
+      }
+      f16.ToBytes(out_ptr);
+    }
+  } else {
+#if ARROW_LITTLE_ENDIAN
+    // No need to byte-swap, so do a simple copy
+    std::memcpy(out_ptr, binary_array.raw_values(), length * byte_width);
+#else
+    for (int64_t i = 0; i < length; ++i, out_ptr += byte_width) {
+      const uint8_t* in_ptr = binary_array.GetValue(i);
+      Float16::FromLittleEndian(in_ptr).ToBytes(out_ptr);
+    }
+#endif
+  }
+
+  *out = std::make_shared<::arrow::HalfFloatArray>(
+      type, length, std::move(data), binary_array.null_bitmap(), null_count);
+  return Status::OK();
+}
+
+/// \brief Convert an arrow::BinaryArray to an arrow::HalfFloatArray
+/// We do this by:
+/// 1. Creating an arrow::BinaryArray from the RecordReader's builder
+/// 2. Allocating a buffer for the arrow::HalfFloatArray
+/// 3. Converting the little-endian bytes in each BinaryArray entry to native-endian
+/// halffloat (uint16_t) values
+Status TransferHalfFloat(RecordReader* reader, MemoryPool* pool,
+                         const std::shared_ptr<Field>& field, Datum* out) {
+  auto binary_reader = dynamic_cast<BinaryRecordReader*>(reader);
+  DCHECK(binary_reader);
+  ::arrow::ArrayVector chunks = binary_reader->GetBuilderChunks();

Review Comment:
   `BinaryRecordReader` can be regard as a base class, so it would work here. however I think it would be more clear with @bkietz 's advice



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] bkietz commented on a diff in pull request #36073: GH-36036: [C++][Python][Parquet] Implement Float16 logical type

Posted by "bkietz (via GitHub)" <gi...@apache.org>.
bkietz commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1329233889


##########
cpp/src/arrow/util/float16.h:
##########
@@ -0,0 +1,207 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <array>
+#include <cstdint>
+#include <cstring>
+#include <iosfwd>
+#include <limits>
+#include <type_traits>
+
+#include "arrow/util/endian.h"
+#include "arrow/util/macros.h"
+#include "arrow/util/ubsan.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// \brief Class representing an IEEE half-precision float, encoded as a `uint16_t`
+///
+/// The exact format is as follows (from LSB to MSB):
+/// - bits 0-10:  mantissa
+/// - bits 10-15: exponent
+/// - bit 15:     sign
+///
+class ARROW_EXPORT Float16 {
+ public:
+  Float16() = default;
+  constexpr explicit Float16(uint16_t value) : value_(value) {}
+
+  template <typename T, typename std::enable_if_t<std::is_floating_point_v<T>>* = NULLPTR>
+  explicit Float16(T f) : Float16(FromNative(f)) {}
+
+  /// \brief Create a `Float16` from a 32-bit float (may lose precision)
+  static Float16 FromFloat(float f);
+  /// \brief Create a `Float16` from a 64-bit float (may lose precision)
+  static Float16 FromDouble(double d);
+  /// \brief Create a `Float16` from a native floating-point value (may lose precision)
+  static Float16 FromNative(float f) { return FromFloat(f); }
+  static Float16 FromNative(double d) { return FromDouble(d); }
+
+  /// \brief Read a `Float16` from memory in native-endian byte order
+  static Float16 FromBytes(const uint8_t* src) {
+    return Float16(SafeLoadAs<uint16_t>(src));
+  }
+
+  /// \brief Read a `Float16` from memory in little-endian byte order
+  static Float16 FromLittleEndian(const uint8_t* src) {
+    return Float16(::arrow::bit_util::FromLittleEndian(SafeLoadAs<uint16_t>(src)));
+  }
+
+  /// \brief Read a `Float16` from memory in big-endian byte order
+  static Float16 FromBigEndian(const uint8_t* src) {
+    return Float16(::arrow::bit_util::FromBigEndian(SafeLoadAs<uint16_t>(src)));
+  }
+
+  /// \brief Return the value's integer representation
+  constexpr uint16_t bits() const { return value_; }
+
+  /// \brief Return true if the value is negative (sign bit is set)
+  constexpr bool signbit() const { return (value_ & 0x8000) != 0; }
+
+  /// \brief Return true if the value is NaN
+  constexpr bool is_nan() const {
+    return (value_ & 0x7c00) == 0x7c00 && (value_ & 0x03ff) != 0;
+  }
+  /// \brief Return true if the value is positive/negative infinity
+  constexpr bool is_infinity() const { return (value_ & 0x7fff) == 0x7c00; }
+  /// \brief Return true if the value is finite and not NaN
+  constexpr bool is_finite() const { return (value_ & 0x7c00) != 0x7c00; }
+  /// \brief Return true if the value is positive/negative zero
+  constexpr bool is_zero() const { return (value_ & 0x7fff) == 0; }
+
+  /// \brief Convert to a 32-bit float
+  float ToFloat() const;
+  /// \brief Convert to a 64-bit float
+  double ToDouble() const;
+
+  explicit operator float() const { return ToFloat(); }
+  explicit operator double() const { return ToDouble(); }
+
+  /// \brief Copy the value's bytes in native-endian byte order
+  void ToBytes(uint8_t* dest) const { std::memcpy(dest, &value_, sizeof(value_)); }
+  /// \brief Return the value's bytes in native-endian byte order
+  constexpr std::array<uint8_t, 2> ToBytes() const {
+#if ARROW_LITTLE_ENDIAN
+    return ToLittleEndian();
+#else
+    return ToBigEndian();
+#endif
+  }
+
+  /// \brief Copy the value's bytes in little-endian byte order
+  void ToLittleEndian(uint8_t* dest) const {
+    Float16{::arrow::bit_util::ToLittleEndian(value_)}.ToBytes(dest);
+  }
+  /// \brief Return the value's bytes in little-endian byte order
+  constexpr std::array<uint8_t, 2> ToLittleEndian() const {
+#if ARROW_LITTLE_ENDIAN
+    return {uint8_t(value_ & 0xff), uint8_t(value_ >> 8)};
+#else
+    return {uint8_t(value_ >> 8), uint8_t(value_ & 0xff)};
+#endif
+  }
+
+  /// \brief Copy the value's bytes in big-endian byte order
+  void ToBigEndian(uint8_t* dest) const {
+    Float16{::arrow::bit_util::ToBigEndian(value_)}.ToBytes(dest);
+  }
+  /// \brief Return the value's bytes in big-endian byte order
+  constexpr std::array<uint8_t, 2> ToBigEndian() const {
+#if ARROW_LITTLE_ENDIAN
+    return {uint8_t(value_ >> 8), uint8_t(value_ & 0xff)};
+#else
+    return {uint8_t(value_ & 0xff), uint8_t(value_ >> 8)};
+#endif
+  }
+
+  constexpr Float16 operator-() const { return Float16(value_ ^ 0x8000); }
+  constexpr Float16 operator+() const { return Float16(value_); }
+
+  friend constexpr bool operator==(Float16 lhs, Float16 rhs) {
+    if (lhs.is_nan() || rhs.is_nan()) return false;
+    return Float16::CompareEq(lhs, rhs);
+  }
+  friend constexpr bool operator!=(Float16 lhs, Float16 rhs) { return !(lhs == rhs); }
+
+  friend constexpr bool operator<(Float16 lhs, Float16 rhs) {
+    if (lhs.is_nan() || rhs.is_nan()) return false;
+    return Float16::CompareLt(lhs, rhs);
+  }
+  friend constexpr bool operator>(Float16 lhs, Float16 rhs) { return rhs < lhs; }
+
+  friend constexpr bool operator<=(Float16 lhs, Float16 rhs) {
+    if (lhs.is_nan() || rhs.is_nan()) return false;
+    return !Float16::CompareLt(rhs, lhs);
+  }
+  friend constexpr bool operator>=(Float16 lhs, Float16 rhs) { return rhs <= lhs; }
+
+  ARROW_FRIEND_EXPORT friend std::ostream& operator<<(std::ostream& os, Float16 arg);
+
+ protected:
+  uint16_t value_;

Review Comment:
   I think it's worth noting that this type is currently `trivial`. That's useful in a number of ways (tends to hint to the compiler that it should be even more forgiving about aliasing, for example) but it does imply objects of this type may have indeterminate value:
   
   ```c++
   int a; // default initialization of trivial -> indeterminate value
   std::cout << a << std::endl; // UB; use of an indeterminate value
   
   int b{}; // explicit zero initialization
   std::cout << b << std::endl; // prints '0'
   
   Float16 fa; // default initialization of trivial -> indeterminate value
   std::cout << fa.bits() << std::endl; // UB; use of an indeterminate value
   
   Float16 fz{}; // explicit aggregate initialization of trivial
                 // -> will zero-initialize `value_`
   std::cout << fz.bits() << std::endl; // prints '0'
   ```
   
   If this is not desired, we could add a default member initializer here
   
   ```suggestion
     uint16_t value_{};
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #36073: GH-36036: [C++][Python][Parquet] Implement Float16 logical type

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1672431139

   > If you want, I can provide my work in some form as a starting point.
   
   Yes, it would be good to continue the work based on your local change on the Java side.
   
   > Right now I'm working on conversions between short/half-precision floats for this PR, so if the need arises for similar functionality in parquet-mr then stay tuned.
   
   IMO, the float16 support in the parquet-mr should be simpler than here. At least it does not require to provide a full-featured float16 data type implementation. CMIW, the goal of parquet-mr is to support read/write float16 values (and stats/page index) in the form of `byte[2]`. It is the engine's responsibility to handle conversion from/to any customized float16 type.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] benibus commented on pull request #36073: GH-36036: [C++][Python][Parquet] Implement Float16 logical type

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1720139279

   @zhangjiashen Awesome, thanks! Just let us know if anything changes.
   
   BTW, when you do open the PR, there's already any existing ticket that you can reference [here](https://issues.apache.org/jira/browse/PARQUET-1647) - so no need to create a new one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-36036: [C++][Python][Parquet] Implement Float16 logical type [arrow]

Posted by "ianmcook (via GitHub)" <gi...@apache.org>.
ianmcook commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1761964627

   @benibus thanks! Using `arrow::util::Float16(float_value).bits()` solves it. I updated the example Gist accordingly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #36073: GH-36036: [C++][Parquet] Implement Float16 logical type

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1600905393

   Well, if Arrow integration does not come in this PR, then at least some testing of reading/writing using non-Arrow Parquet APIs should still be added.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] benibus commented on a diff in pull request #36073: GH-36036: [C++][Parquet] Implement Float16 logical type

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1234322437


##########
cpp/src/arrow/util/float16.h:
##########
@@ -0,0 +1,179 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <array>
+#include <cmath>
+#include <cstdint>
+#include <cstring>
+#include <limits>
+#include <type_traits>
+
+#include "arrow/util/bit_util.h"
+#include "arrow/util/ubsan.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// \brief Base class for an IEEE half-precision float, encoded as a `uint16_t`
+///
+/// The exact format is as follows (from MSB to LSB):
+/// - bit 0:     sign
+/// - bits 1-5:  exponent
+/// - bits 6-15: mantissa
+///
+/// NOTE: Methods in the class should not mutate the unerlying value or produce copies.
+/// Such functionality is delegated to subclasses.
+class Float16Base {
+ public:
+  Float16Base() = default;
+  constexpr explicit Float16Base(uint16_t value) : value_(value) {}
+
+  constexpr uint16_t bits() const { return value_; }
+  constexpr explicit operator uint16_t() const { return bits(); }
+
+  constexpr bool signbit() const { return (value_ & 0x8000) != 0; }
+
+  constexpr bool is_nan() const {
+    return (value_ & 0x7c00) == 0x7c00 && (value_ & 0x03ff) != 0;
+  }
+  constexpr bool is_infinity() const { return (value_ & 0x7fff) == 0x7c00; }
+  constexpr bool is_zero() const { return (value_ & 0x7fff) == 0; }
+
+  /// \brief Copy the value's bytes in native-endian byte order
+  void ToBytes(uint8_t* dest) const { std::memcpy(dest, &value_, sizeof(value_)); }
+  /// \brief Return the value's bytes in native-endian byte order
+  std::array<uint8_t, 2> ToBytes() const {
+    std::array<uint8_t, 2> bytes;
+    ToBytes(bytes.data());
+    return bytes;
+  }
+
+  void ToLittleEndian(uint8_t* dest) const {
+    Float16Base{bit_util::ToLittleEndian(value_)}.ToBytes(dest);
+  }
+  std::array<uint8_t, 2> ToLittleEndian() const {
+    return Float16Base{bit_util::ToLittleEndian(value_)}.ToBytes();
+  }
+
+  void ToBigEndian(uint8_t* dest) const {
+    Float16Base{bit_util::ToBigEndian(value_)}.ToBytes(dest);
+  }
+  std::array<uint8_t, 2> ToBigEndian() const {
+    return Float16Base{bit_util::ToBigEndian(value_)}.ToBytes();
+  }
+
+  friend constexpr bool operator==(Float16Base lhs, Float16Base rhs) {
+    if (lhs.is_nan() || rhs.is_nan()) return false;
+    return Float16Base::CompareEq(lhs, rhs);
+  }
+  friend constexpr bool operator!=(Float16Base lhs, Float16Base rhs) {
+    return !(lhs == rhs);

Review Comment:
   Are you sure? Just double-checked and native float equivalents return true for me (i.e. `1.0f != std::numeric_limits<float>::quiet_NaN()`, `std::numeric_limits<float>::quiet_NaN() != std::numeric_limits<float>::quiet_NaN()`)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] benibus commented on a diff in pull request #36073: GH-36036: [C++][Parquet] Implement Float16 logical type

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1234454918


##########
cpp/src/arrow/util/float16_test.cc:
##########
@@ -0,0 +1,160 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <array>
+#include <utility>
+#include <vector>
+
+#include <gtest/gtest.h>
+
+#include "arrow/testing/gtest_util.h"
+#include "arrow/util/endian.h"
+#include "arrow/util/float16.h"
+#include "arrow/util/ubsan.h"
+
+namespace arrow {
+namespace util {
+namespace {
+
+template <typename T>
+using Limits = std::numeric_limits<T>;
+
+// Holds a float16 and its equivalent float32
+struct TestValue {
+  TestValue(Float16 f16, float f32) : f16(f16), f32(f32) {}
+  TestValue(uint16_t u16, float f32) : TestValue(Float16(u16), f32) {}
+
+  Float16 f16;
+  float f32;
+};
+
+#define GENERATE_OPERATOR(NAME, OP)                              \
+  struct NAME {                                                  \
+    std::pair<bool, bool> operator()(TestValue l, TestValue r) { \
+      return std::make_pair((l.f32 OP r.f32), (l.f16 OP r.f16)); \
+    }                                                            \
+  }
+
+GENERATE_OPERATOR(CompareEq, ==);
+GENERATE_OPERATOR(CompareNe, !=);
+GENERATE_OPERATOR(CompareLt, <);
+GENERATE_OPERATOR(CompareGt, >);
+GENERATE_OPERATOR(CompareLe, <=);
+GENERATE_OPERATOR(CompareGe, >=);
+
+#undef GENERATE_OPERATOR
+
+const std::vector<TestValue> g_test_values = {
+    TestValue(Limits<Float16>::min(), +0.00006104f),
+    TestValue(Limits<Float16>::max(), +65504.0f),
+    TestValue(Limits<Float16>::lowest(), -65504.0f),
+    TestValue(+Limits<Float16>::infinity(), +Limits<float>::infinity()),
+    TestValue(-Limits<Float16>::infinity(), -Limits<float>::infinity()),
+    // Multiple (semantically equivalent) NaN representations
+    TestValue(0x7fff, Limits<float>::quiet_NaN()),
+    TestValue(0xffff, Limits<float>::quiet_NaN()),
+    TestValue(0x7e00, Limits<float>::quiet_NaN()),
+    TestValue(0xfe00, Limits<float>::quiet_NaN()),
+    // Positive/negative zeroes
+    TestValue(0x0000, +0.0f),
+    TestValue(0x8000, -0.0f),
+    // Miscellaneous values. In general, they're chosen to test the sign/exponent and
+    // exponent/mantissa boundaries
+    TestValue(0x101c, +0.000502f),
+    TestValue(0x901c, -0.000502f),
+    TestValue(0x101d, +0.0005022f),
+    TestValue(0x901d, -0.0005022f),
+    TestValue(0x121c, +0.000746f),
+    TestValue(0x921c, -0.000746f),
+    TestValue(0x141c, +0.001004f),
+    TestValue(0x941c, -0.001004f),
+    TestValue(0x501c, +32.9f),
+    TestValue(0xd01c, -32.9f),
+    // A few subnormals for good measure
+    TestValue(0x001c, +0.0000017f),
+    TestValue(0x801c, -0.0000017f),
+    TestValue(0x021c, +0.0000332f),
+    TestValue(0x821c, -0.0000332f),
+};
+
+template <typename Operator>
+class Float16OperatorTest : public ::testing::Test {
+ public:
+  void TestCompare(const std::vector<TestValue>& test_values) {
+    const auto num_values = static_cast<int>(test_values.size());
+
+    // Check all combinations of operands in both directions
+    for (int offset = 0; offset < num_values; ++offset) {

Review Comment:
   Haha yeah... had a more convoluted approach in mind initially and didn't reexamine that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36073: GH-36036: [C++][Parquet] Implement Float16 logical type

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1230932433


##########
cpp/src/parquet/float_internal.h:
##########
@@ -0,0 +1,61 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <cstdint>
+#include <cstring>
+
+#include "arrow/util/bit_util.h"
+#include "arrow/util/ubsan.h"
+#include "parquet/types.h"
+
+namespace parquet {
+
+struct float16 {
+  constexpr static uint16_t min() { return 0b1111101111111111; }
+  constexpr static uint16_t max() { return 0b0111101111111111; }
+  constexpr static uint16_t positive_zero() { return 0b0000000000000000; }
+  constexpr static uint16_t negative_zero() { return 0b1000000000000000; }
+
+  static uint8_t* min_ptr() { return min_; }
+  static uint8_t* max_ptr() { return max_; }
+  static uint8_t* positive_zero_ptr() { return positive_zero_; }
+  static uint8_t* negative_zero_ptr() { return negative_zero_; }
+
+  static bool is_nan(uint16_t n) { return (n & 0x7c00) == 0x7c00 && (n & 0x03ff) != 0; }
+  static bool is_zero(uint16_t n) { return (n & 0x7fff) == 0; }
+  static bool signbit(uint16_t n) { return (n & 0x8000) != 0; }
+
+  static uint16_t Pack(const uint8_t* src) {

Review Comment:
   I'm not sure, but the terminology seems a bit confusing to me. For example, in the Python stdlib [pack](https://docs.python.org/3/library/struct.html#struct.pack) means serialize to bytes and [unpack](https://docs.python.org/3/library/struct.html#struct.unpack) means the reverse...



##########
cpp/src/parquet/float_internal.h:
##########
@@ -0,0 +1,61 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <cstdint>
+#include <cstring>
+
+#include "arrow/util/bit_util.h"
+#include "arrow/util/ubsan.h"
+#include "parquet/types.h"
+
+namespace parquet {
+
+struct float16 {

Review Comment:
   Do we want to put these helpers in `arrow/util` instead? At some point we'll probably want to process float16 data in Arrow C++ as well...
   



##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -907,6 +921,36 @@ void TestStatisticsSortOrder<FLBAType>::SetValues() {
       .set_max(std::string(reinterpret_cast<const char*>(&vals[8][0]), FLBA_LENGTH));
 }
 
+template <>
+void TestStatisticsSortOrder<Float16LogicalType>::AddNodes(std::string name) {
+  auto node =
+      schema::PrimitiveNode::Make(name, Repetition::REQUIRED, LogicalType::Float16(),
+                                  Type::FIXED_LEN_BYTE_ARRAY, sizeof(uint16_t));
+  fields_.push_back(std::move(node));
+}
+
+template <>
+void TestStatisticsSortOrder<Float16LogicalType>::SetValues() {
+  constexpr int kValueLen = 2;
+  constexpr int kNumBytes = NUM_VALUES * kValueLen;
+
+  const uint16_t packed_vals[NUM_VALUES] = {

Review Comment:
   Add a comment listing the actual float values?



##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -1073,50 +1125,217 @@ void CheckExtrema() {
 TEST(TestStatistic, Int32Extrema) { CheckExtrema<Int32Type>(); }
 TEST(TestStatistic, Int64Extrema) { CheckExtrema<Int64Type>(); }
 
-// PARQUET-1225: Float NaN values may lead to incorrect min-max
-template <typename ParquetType>
-void CheckNaNs() {
-  using T = typename ParquetType::c_type;
+template <typename T>
+class TestFloatStatistics : public ::testing::Test {
+ public:
+  using ParquetType = typename RebindLogical<T>::ParquetType;
+  using c_type = typename ParquetType::c_type;
+
+  void Init();
+  void SetUp() override { this->Init(); }
+
+  bool signbit(c_type val);
+  void CheckEq(const c_type& l, const c_type& r);
+  NodePtr MakeNode(const std::string& name, Repetition::type rep);
+
+  template <typename Stats, typename Values>
+  void CheckMinMaxZeroesSign(Stats stats, const Values& values) {
+    stats->Update(values.data(), values.size(), 0);
+    ASSERT_TRUE(stats->HasMinMax());
+
+    this->CheckEq(stats->min(), positive_zero_);
+    ASSERT_TRUE(this->signbit(stats->min()));
+
+    this->CheckEq(stats->max(), positive_zero_);
+    ASSERT_FALSE(this->signbit(stats->max()));
+  }
+
+  // ARROW-5562: Ensure that -0.0f and 0.0f values are properly handled like in
+  // parquet-mr
+  void TestNegativeZeroes() {
+    NodePtr node = this->MakeNode("f", Repetition::OPTIONAL);
+    ColumnDescriptor descr(node, 1, 1);
+
+    {
+      std::array<c_type, 2> values{negative_zero_, positive_zero_};
+      auto stats = MakeStatistics<ParquetType>(&descr);
+      CheckMinMaxZeroesSign(stats, values);
+    }
+
+    {
+      std::array<c_type, 2> values{positive_zero_, negative_zero_};
+      auto stats = MakeStatistics<ParquetType>(&descr);
+      CheckMinMaxZeroesSign(stats, values);
+    }
+
+    {
+      std::array<c_type, 2> values{negative_zero_, negative_zero_};
+      auto stats = MakeStatistics<ParquetType>(&descr);
+      CheckMinMaxZeroesSign(stats, values);
+    }
+
+    {
+      std::array<c_type, 2> values{positive_zero_, positive_zero_};
+      auto stats = MakeStatistics<ParquetType>(&descr);
+      CheckMinMaxZeroesSign(stats, values);
+    }
+  }
+
+  // PARQUET-1225: Float NaN values may lead to incorrect min-max
+  template <typename Values>
+  void CheckNaNs(ColumnDescriptor* descr, const Values& all_nans, const Values& some_nans,
+                 const Values& other_nans, c_type min, c_type max, uint8_t valid_bitmap,
+                 uint8_t valid_bitmap_no_nans) {
+    auto some_nan_stats = MakeStatistics<ParquetType>(descr);
+    // Ingesting only nans should not yield valid min max
+    AssertUnsetMinMax(some_nan_stats, all_nans);
+    // Ingesting a mix of NaNs and non-NaNs should not yield valid min max.

Review Comment:
   The comment doesn't match the assertion below, does it?



##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -972,12 +1016,20 @@ TEST_F(TestStatisticsSortOrderFLBA, UnknownSortOrder) {
   ASSERT_FALSE(cc_metadata->is_stats_set());
 }
 
+template <typename T>
+static std::string EncodeValue(const T& val) {
+  return std::string(reinterpret_cast<const char*>(&val), sizeof(val));
+}
+static std::string EncodeValue(const FLBA& val, int length = sizeof(uint16_t)) {
+  return std::string(reinterpret_cast<const char*>(val.ptr), length);
+}
+
 template <typename Stats, typename Array, typename T = typename Array::value_type>
 void AssertMinMaxAre(Stats stats, const Array& values, T expected_min, T expected_max) {
   stats->Update(values.data(), values.size(), 0);
   ASSERT_TRUE(stats->HasMinMax());
-  EXPECT_EQ(stats->min(), expected_min);
-  EXPECT_EQ(stats->max(), expected_max);
+  EXPECT_EQ(stats->EncodeMin(), EncodeValue(expected_min));
+  EXPECT_EQ(stats->EncodeMax(), EncodeValue(expected_max));

Review Comment:
   This is because `FLBA` doesn't have a `operator==`, right?



##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -1073,50 +1125,217 @@ void CheckExtrema() {
 TEST(TestStatistic, Int32Extrema) { CheckExtrema<Int32Type>(); }
 TEST(TestStatistic, Int64Extrema) { CheckExtrema<Int64Type>(); }
 
-// PARQUET-1225: Float NaN values may lead to incorrect min-max
-template <typename ParquetType>
-void CheckNaNs() {
-  using T = typename ParquetType::c_type;
+template <typename T>
+class TestFloatStatistics : public ::testing::Test {
+ public:
+  using ParquetType = typename RebindLogical<T>::ParquetType;
+  using c_type = typename ParquetType::c_type;
+
+  void Init();
+  void SetUp() override { this->Init(); }
+
+  bool signbit(c_type val);
+  void CheckEq(const c_type& l, const c_type& r);
+  NodePtr MakeNode(const std::string& name, Repetition::type rep);
+
+  template <typename Stats, typename Values>
+  void CheckMinMaxZeroesSign(Stats stats, const Values& values) {
+    stats->Update(values.data(), values.size(), 0);
+    ASSERT_TRUE(stats->HasMinMax());
+
+    this->CheckEq(stats->min(), positive_zero_);
+    ASSERT_TRUE(this->signbit(stats->min()));
+
+    this->CheckEq(stats->max(), positive_zero_);
+    ASSERT_FALSE(this->signbit(stats->max()));
+  }
+
+  // ARROW-5562: Ensure that -0.0f and 0.0f values are properly handled like in
+  // parquet-mr
+  void TestNegativeZeroes() {
+    NodePtr node = this->MakeNode("f", Repetition::OPTIONAL);
+    ColumnDescriptor descr(node, 1, 1);
+
+    {
+      std::array<c_type, 2> values{negative_zero_, positive_zero_};
+      auto stats = MakeStatistics<ParquetType>(&descr);
+      CheckMinMaxZeroesSign(stats, values);
+    }
+
+    {
+      std::array<c_type, 2> values{positive_zero_, negative_zero_};
+      auto stats = MakeStatistics<ParquetType>(&descr);
+      CheckMinMaxZeroesSign(stats, values);
+    }
+
+    {
+      std::array<c_type, 2> values{negative_zero_, negative_zero_};
+      auto stats = MakeStatistics<ParquetType>(&descr);
+      CheckMinMaxZeroesSign(stats, values);
+    }
+
+    {
+      std::array<c_type, 2> values{positive_zero_, positive_zero_};
+      auto stats = MakeStatistics<ParquetType>(&descr);
+      CheckMinMaxZeroesSign(stats, values);
+    }
+  }
+
+  // PARQUET-1225: Float NaN values may lead to incorrect min-max
+  template <typename Values>
+  void CheckNaNs(ColumnDescriptor* descr, const Values& all_nans, const Values& some_nans,
+                 const Values& other_nans, c_type min, c_type max, uint8_t valid_bitmap,
+                 uint8_t valid_bitmap_no_nans) {
+    auto some_nan_stats = MakeStatistics<ParquetType>(descr);
+    // Ingesting only nans should not yield valid min max
+    AssertUnsetMinMax(some_nan_stats, all_nans);
+    // Ingesting a mix of NaNs and non-NaNs should not yield valid min max.
+    AssertMinMaxAre(some_nan_stats, some_nans, min, max);
+    // Ingesting only nans after a valid min/max, should have not effect
+    AssertMinMaxAre(some_nan_stats, all_nans, min, max);
+
+    some_nan_stats = MakeStatistics<ParquetType>(descr);
+    AssertUnsetMinMax(some_nan_stats, all_nans, &valid_bitmap);
+    // NaNs should not pollute min max when excluded via null bitmap.
+    AssertMinMaxAre(some_nan_stats, some_nans, &valid_bitmap_no_nans, min, max);
+    // Ingesting NaNs with a null bitmap should not change the result.
+    AssertMinMaxAre(some_nan_stats, some_nans, &valid_bitmap, min, max);
+
+    // An array that doesn't start with NaN
+    auto other_stats = MakeStatistics<ParquetType>(descr);
+    AssertMinMaxAre(other_stats, other_nans, min, max);
+  }
+
+  void TestNaNs();
+
+ protected:
+  std::vector<uint8_t> data_buf_;
+  c_type positive_zero_;
+  c_type negative_zero_;
+};
+
+template <typename T>
+void TestFloatStatistics<T>::Init() {
+  positive_zero_ = c_type{};
+  negative_zero_ = -positive_zero_;
+}
+template <>
+void TestFloatStatistics<Float16LogicalType>::Init() {
+  positive_zero_ = c_type{float16::positive_zero_ptr()};
+  negative_zero_ = c_type{float16::negative_zero_ptr()};
+}
+
+template <typename T>
+NodePtr TestFloatStatistics<T>::MakeNode(const std::string& name, Repetition::type rep) {
+  return PrimitiveNode::Make(name, rep, ParquetType::type_num);
+}
+template <>
+NodePtr TestFloatStatistics<Float16LogicalType>::MakeNode(const std::string& name,
+                                                          Repetition::type rep) {
+  return PrimitiveNode::Make(name, rep, LogicalType::Float16(),
+                             Type::FIXED_LEN_BYTE_ARRAY, 2);
+}
 
+template <typename T>
+void TestFloatStatistics<T>::CheckEq(const c_type& l, const c_type& r) {
+  ASSERT_EQ(l, r);
+}
+template <>
+void TestFloatStatistics<Float16LogicalType>::CheckEq(const c_type& a, const c_type& b) {
+  auto l = float16::Pack(a);
+  auto r = float16::Pack(b);
+  if (float16::is_zero(l) && float16::is_zero(r)) return;
+  ASSERT_EQ(l, r);
+}
+
+template <typename T>
+bool TestFloatStatistics<T>::signbit(c_type val) {
+  return std::signbit(val);
+}
+template <>
+bool TestFloatStatistics<Float16LogicalType>::signbit(c_type val) {
+  return float16::signbit(float16::Pack(val));
+}
+
+template <typename T>
+void TestFloatStatistics<T>::TestNaNs() {
   constexpr int kNumValues = 8;
-  NodePtr node = PrimitiveNode::Make("f", Repetition::OPTIONAL, ParquetType::type_num);
+  NodePtr node = this->MakeNode("f", Repetition::OPTIONAL);
   ColumnDescriptor descr(node, 1, 1);
 
-  constexpr T nan = std::numeric_limits<T>::quiet_NaN();
-  constexpr T min = -4.0f;
-  constexpr T max = 3.0f;
+  constexpr c_type nan = std::numeric_limits<c_type>::quiet_NaN();
+  constexpr c_type min = -4.0f;
+  constexpr c_type max = 3.0f;
+
+  std::array<c_type, kNumValues> all_nans{nan, nan, nan, nan, nan, nan, nan, nan};
+  std::array<c_type, kNumValues> some_nans{nan, max, -3.0f, -1.0f, nan, 2.0f, min, nan};
+  std::array<c_type, kNumValues> other_nans{1.5f, max, -3.0f, -1.0f, nan, 2.0f, min, nan};
 
-  std::array<T, kNumValues> all_nans{nan, nan, nan, nan, nan, nan, nan, nan};
-  std::array<T, kNumValues> some_nans{nan, max, -3.0f, -1.0f, nan, 2.0f, min, nan};
   uint8_t valid_bitmap = 0x7F;  // 0b01111111
   // NaNs excluded
   uint8_t valid_bitmap_no_nans = 0x6E;  // 0b01101110
 
-  // Test values
-  auto some_nan_stats = MakeStatistics<ParquetType>(&descr);
-  // Ingesting only nans should not yield valid min max
-  AssertUnsetMinMax(some_nan_stats, all_nans);
-  // Ingesting a mix of NaNs and non-NaNs should not yield valid min max.
-  AssertMinMaxAre(some_nan_stats, some_nans, min, max);
-  // Ingesting only nans after a valid min/max, should have not effect
-  AssertMinMaxAre(some_nan_stats, all_nans, min, max);
+  this->CheckNaNs(&descr, all_nans, some_nans, other_nans, min, max, valid_bitmap,
+                  valid_bitmap_no_nans);
+}
+
+template <>
+void TestFloatStatistics<Float16LogicalType>::TestNaNs() {
+  constexpr int kNumValues = 8;
+  constexpr int kValueLen = sizeof(uint16_t);
+
+  NodePtr node = this->MakeNode("f", Repetition::OPTIONAL);
+  ColumnDescriptor descr(node, 1, 1);
+
+  const uint16_t nan_int = 0b1111110010101010;
+  const uint16_t min_int = 0b1010010111000110;

Review Comment:
   Can you comment on FP values for non-NaN constants?



##########
cpp/src/parquet/float_internal.h:
##########
@@ -0,0 +1,61 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <cstdint>
+#include <cstring>
+
+#include "arrow/util/bit_util.h"
+#include "arrow/util/ubsan.h"
+#include "parquet/types.h"
+
+namespace parquet {
+
+struct float16 {
+  constexpr static uint16_t min() { return 0b1111101111111111; }
+  constexpr static uint16_t max() { return 0b0111101111111111; }
+  constexpr static uint16_t positive_zero() { return 0b0000000000000000; }
+  constexpr static uint16_t negative_zero() { return 0b1000000000000000; }
+
+  static uint8_t* min_ptr() { return min_; }
+  static uint8_t* max_ptr() { return max_; }
+  static uint8_t* positive_zero_ptr() { return positive_zero_; }
+  static uint8_t* negative_zero_ptr() { return negative_zero_; }
+
+  static bool is_nan(uint16_t n) { return (n & 0x7c00) == 0x7c00 && (n & 0x03ff) != 0; }
+  static bool is_zero(uint16_t n) { return (n & 0x7fff) == 0; }
+  static bool signbit(uint16_t n) { return (n & 0x8000) != 0; }

Review Comment:
   Why not make all these methods constexpr?



##########
cpp/src/parquet/statistics.cc:
##########
@@ -277,11 +278,54 @@ template <bool is_signed>
 struct CompareHelper<FLBAType, is_signed>
     : public BinaryLikeCompareHelperBase<FLBAType, is_signed> {};
 
+struct Float16CompareHelper {
+  using T = FLBA;
+
+  static T DefaultMin() { return T{float16::max_ptr()}; }
+  static T DefaultMax() { return T{float16::min_ptr()}; }
+
+  static T Coalesce(T val, T fallback) {
+    return val.ptr != nullptr && float16::is_nan(float16::Pack(val)) ? fallback : val;
+  }
+
+  static inline bool Compare(int type_length, const T& a, const T& b) {
+    uint16_t l = float16::Pack(a);
+    uint16_t r = float16::Pack(b);

Review Comment:
   Do we know for sure that neither `l` nor `r` can be NaN here? If so, add `DCHECK`s. Otherwise, false should always be returned if either is NaN (or is it implicit below?).



##########
cpp/src/parquet/float_internal.h:
##########
@@ -0,0 +1,61 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <cstdint>
+#include <cstring>
+
+#include "arrow/util/bit_util.h"
+#include "arrow/util/ubsan.h"
+#include "parquet/types.h"
+
+namespace parquet {
+
+struct float16 {
+  constexpr static uint16_t min() { return 0b1111101111111111; }
+  constexpr static uint16_t max() { return 0b0111101111111111; }
+  constexpr static uint16_t positive_zero() { return 0b0000000000000000; }
+  constexpr static uint16_t negative_zero() { return 0b1000000000000000; }
+
+  static uint8_t* min_ptr() { return min_; }
+  static uint8_t* max_ptr() { return max_; }
+  static uint8_t* positive_zero_ptr() { return positive_zero_; }
+  static uint8_t* negative_zero_ptr() { return negative_zero_; }
+
+  static bool is_nan(uint16_t n) { return (n & 0x7c00) == 0x7c00 && (n & 0x03ff) != 0; }
+  static bool is_zero(uint16_t n) { return (n & 0x7fff) == 0; }
+  static bool signbit(uint16_t n) { return (n & 0x8000) != 0; }
+
+  static uint16_t Pack(const uint8_t* src) {
+    return ::arrow::bit_util::FromLittleEndian(::arrow::util::SafeLoadAs<uint16_t>(src));
+  }
+  static uint16_t Pack(const FLBA& src) { return Pack(src.ptr); }
+
+  static uint8_t* Unpack(uint16_t src, uint8_t* dest) {
+    src = ::arrow::bit_util::ToLittleEndian(src);
+    return static_cast<uint8_t*>(std::memcpy(dest, &src, sizeof(src)));

Review Comment:
   Is it worth actually returning the destination pointer?



##########
cpp/src/parquet/statistics.cc:
##########
@@ -277,11 +278,54 @@ template <bool is_signed>
 struct CompareHelper<FLBAType, is_signed>
     : public BinaryLikeCompareHelperBase<FLBAType, is_signed> {};
 
+struct Float16CompareHelper {
+  using T = FLBA;
+
+  static T DefaultMin() { return T{float16::max_ptr()}; }
+  static T DefaultMax() { return T{float16::min_ptr()}; }
+
+  static T Coalesce(T val, T fallback) {
+    return val.ptr != nullptr && float16::is_nan(float16::Pack(val)) ? fallback : val;
+  }
+
+  static inline bool Compare(int type_length, const T& a, const T& b) {
+    uint16_t l = float16::Pack(a);
+    uint16_t r = float16::Pack(b);
+
+    if (l & 0x8000) {

Review Comment:
   Why not use the helper functions, e.g. `signbit`, instead of resorting to magic bitmasks?



##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -1073,50 +1125,217 @@ void CheckExtrema() {
 TEST(TestStatistic, Int32Extrema) { CheckExtrema<Int32Type>(); }
 TEST(TestStatistic, Int64Extrema) { CheckExtrema<Int64Type>(); }
 
-// PARQUET-1225: Float NaN values may lead to incorrect min-max
-template <typename ParquetType>
-void CheckNaNs() {
-  using T = typename ParquetType::c_type;
+template <typename T>
+class TestFloatStatistics : public ::testing::Test {
+ public:
+  using ParquetType = typename RebindLogical<T>::ParquetType;
+  using c_type = typename ParquetType::c_type;
+
+  void Init();
+  void SetUp() override { this->Init(); }
+
+  bool signbit(c_type val);
+  void CheckEq(const c_type& l, const c_type& r);
+  NodePtr MakeNode(const std::string& name, Repetition::type rep);
+
+  template <typename Stats, typename Values>
+  void CheckMinMaxZeroesSign(Stats stats, const Values& values) {
+    stats->Update(values.data(), values.size(), 0);
+    ASSERT_TRUE(stats->HasMinMax());
+
+    this->CheckEq(stats->min(), positive_zero_);

Review Comment:
   By the way, this is where you could perhaps check the encoded form (as +/-0 are semantically equal, but their encodings are different).



##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -1073,50 +1125,217 @@ void CheckExtrema() {
 TEST(TestStatistic, Int32Extrema) { CheckExtrema<Int32Type>(); }
 TEST(TestStatistic, Int64Extrema) { CheckExtrema<Int64Type>(); }
 
-// PARQUET-1225: Float NaN values may lead to incorrect min-max
-template <typename ParquetType>
-void CheckNaNs() {
-  using T = typename ParquetType::c_type;
+template <typename T>
+class TestFloatStatistics : public ::testing::Test {
+ public:
+  using ParquetType = typename RebindLogical<T>::ParquetType;
+  using c_type = typename ParquetType::c_type;
+
+  void Init();
+  void SetUp() override { this->Init(); }
+
+  bool signbit(c_type val);
+  void CheckEq(const c_type& l, const c_type& r);
+  NodePtr MakeNode(const std::string& name, Repetition::type rep);
+
+  template <typename Stats, typename Values>
+  void CheckMinMaxZeroesSign(Stats stats, const Values& values) {
+    stats->Update(values.data(), values.size(), 0);

Review Comment:
   ```suggestion
       stats->Update(values.data(), values.size(), /*null_count=*/ 0);
   ```



##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -1073,50 +1125,217 @@ void CheckExtrema() {
 TEST(TestStatistic, Int32Extrema) { CheckExtrema<Int32Type>(); }
 TEST(TestStatistic, Int64Extrema) { CheckExtrema<Int64Type>(); }
 
-// PARQUET-1225: Float NaN values may lead to incorrect min-max
-template <typename ParquetType>
-void CheckNaNs() {
-  using T = typename ParquetType::c_type;
+template <typename T>
+class TestFloatStatistics : public ::testing::Test {
+ public:
+  using ParquetType = typename RebindLogical<T>::ParquetType;
+  using c_type = typename ParquetType::c_type;
+
+  void Init();
+  void SetUp() override { this->Init(); }
+
+  bool signbit(c_type val);
+  void CheckEq(const c_type& l, const c_type& r);
+  NodePtr MakeNode(const std::string& name, Repetition::type rep);
+
+  template <typename Stats, typename Values>
+  void CheckMinMaxZeroesSign(Stats stats, const Values& values) {
+    stats->Update(values.data(), values.size(), 0);
+    ASSERT_TRUE(stats->HasMinMax());
+
+    this->CheckEq(stats->min(), positive_zero_);
+    ASSERT_TRUE(this->signbit(stats->min()));
+
+    this->CheckEq(stats->max(), positive_zero_);
+    ASSERT_FALSE(this->signbit(stats->max()));
+  }
+
+  // ARROW-5562: Ensure that -0.0f and 0.0f values are properly handled like in
+  // parquet-mr
+  void TestNegativeZeroes() {
+    NodePtr node = this->MakeNode("f", Repetition::OPTIONAL);
+    ColumnDescriptor descr(node, 1, 1);
+
+    {
+      std::array<c_type, 2> values{negative_zero_, positive_zero_};
+      auto stats = MakeStatistics<ParquetType>(&descr);
+      CheckMinMaxZeroesSign(stats, values);
+    }
+
+    {
+      std::array<c_type, 2> values{positive_zero_, negative_zero_};
+      auto stats = MakeStatistics<ParquetType>(&descr);
+      CheckMinMaxZeroesSign(stats, values);
+    }
+
+    {
+      std::array<c_type, 2> values{negative_zero_, negative_zero_};
+      auto stats = MakeStatistics<ParquetType>(&descr);
+      CheckMinMaxZeroesSign(stats, values);
+    }
+
+    {
+      std::array<c_type, 2> values{positive_zero_, positive_zero_};
+      auto stats = MakeStatistics<ParquetType>(&descr);
+      CheckMinMaxZeroesSign(stats, values);
+    }
+  }
+
+  // PARQUET-1225: Float NaN values may lead to incorrect min-max
+  template <typename Values>
+  void CheckNaNs(ColumnDescriptor* descr, const Values& all_nans, const Values& some_nans,
+                 const Values& other_nans, c_type min, c_type max, uint8_t valid_bitmap,
+                 uint8_t valid_bitmap_no_nans) {
+    auto some_nan_stats = MakeStatistics<ParquetType>(descr);
+    // Ingesting only nans should not yield valid min max
+    AssertUnsetMinMax(some_nan_stats, all_nans);
+    // Ingesting a mix of NaNs and non-NaNs should not yield valid min max.
+    AssertMinMaxAre(some_nan_stats, some_nans, min, max);
+    // Ingesting only nans after a valid min/max, should have not effect
+    AssertMinMaxAre(some_nan_stats, all_nans, min, max);
+
+    some_nan_stats = MakeStatistics<ParquetType>(descr);
+    AssertUnsetMinMax(some_nan_stats, all_nans, &valid_bitmap);
+    // NaNs should not pollute min max when excluded via null bitmap.
+    AssertMinMaxAre(some_nan_stats, some_nans, &valid_bitmap_no_nans, min, max);
+    // Ingesting NaNs with a null bitmap should not change the result.
+    AssertMinMaxAre(some_nan_stats, some_nans, &valid_bitmap, min, max);
+
+    // An array that doesn't start with NaN
+    auto other_stats = MakeStatistics<ParquetType>(descr);
+    AssertMinMaxAre(other_stats, other_nans, min, max);
+  }
+
+  void TestNaNs();
+
+ protected:
+  std::vector<uint8_t> data_buf_;
+  c_type positive_zero_;
+  c_type negative_zero_;
+};
+
+template <typename T>
+void TestFloatStatistics<T>::Init() {
+  positive_zero_ = c_type{};
+  negative_zero_ = -positive_zero_;
+}
+template <>
+void TestFloatStatistics<Float16LogicalType>::Init() {
+  positive_zero_ = c_type{float16::positive_zero_ptr()};
+  negative_zero_ = c_type{float16::negative_zero_ptr()};
+}
+
+template <typename T>
+NodePtr TestFloatStatistics<T>::MakeNode(const std::string& name, Repetition::type rep) {
+  return PrimitiveNode::Make(name, rep, ParquetType::type_num);
+}
+template <>
+NodePtr TestFloatStatistics<Float16LogicalType>::MakeNode(const std::string& name,
+                                                          Repetition::type rep) {
+  return PrimitiveNode::Make(name, rep, LogicalType::Float16(),
+                             Type::FIXED_LEN_BYTE_ARRAY, 2);
+}
 
+template <typename T>
+void TestFloatStatistics<T>::CheckEq(const c_type& l, const c_type& r) {
+  ASSERT_EQ(l, r);
+}
+template <>
+void TestFloatStatistics<Float16LogicalType>::CheckEq(const c_type& a, const c_type& b) {
+  auto l = float16::Pack(a);
+  auto r = float16::Pack(b);
+  if (float16::is_zero(l) && float16::is_zero(r)) return;
+  ASSERT_EQ(l, r);

Review Comment:
   Do we care about NaNs here?



##########
cpp/src/parquet/float_internal.h:
##########
@@ -0,0 +1,61 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <cstdint>
+#include <cstring>
+
+#include "arrow/util/bit_util.h"
+#include "arrow/util/ubsan.h"
+#include "parquet/types.h"
+
+namespace parquet {
+
+struct float16 {

Review Comment:
   We might even want to define a `arrow::Float16` struct for easier handling of half-precision floats (together with a bunch of operators, and perhaps overrides for `std::isnan`, `std::numeric_limits` and friends)?
   
   For example:
   ```c++
   struct Float16 {
     uint16_t value;
   
     static Float16 FromBytes(const uint8_t*) { ... }
     static Float16 FromBytes(std::array<uint8_t, 2>) { ... }
     std::array<uint8_t, 2> ToBytes() const { ... }
   
     friend bool operator== ...
   };
   
   static_assert(std::is_trivial_v<Float16>);
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] benibus commented on pull request #36073: GH-36036: [C++][Parquet] Implement Float16 logical type

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1592129270

   cc @anjakefala @zeroshade @pitrou 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #36073: GH-36036: [C++][Python][Parquet] Implement Float16 logical type

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1289446891


##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -1082,6 +1115,45 @@ void TestStatisticsSortOrder<FLBAType>::SetValues() {
       .set_max(std::string(reinterpret_cast<const char*>(&vals[8][0]), FLBA_LENGTH));
 }
 
+template <>
+void TestStatisticsSortOrder<Float16LogicalType>::AddNodes(std::string name) {
+  auto node =
+      schema::PrimitiveNode::Make(name, Repetition::REQUIRED, LogicalType::Float16(),
+                                  Type::FIXED_LEN_BYTE_ARRAY, sizeof(uint16_t));
+  fields_.push_back(std::move(node));
+}
+
+template <>
+void TestStatisticsSortOrder<Float16LogicalType>::SetValues() {
+  constexpr int kValueLen = 2;
+  constexpr int kNumBytes = NUM_VALUES * kValueLen;
+
+  const uint16_t u16_vals[NUM_VALUES] = {
+      0b1100010100000000,  // -5.0
+      0b1100010000000000,  // -4.0
+      0b1100001000000000,  // -3.0
+      0b1100000000000000,  // -2.0
+      0b1011110000000000,  // -1.0
+      0b0000000000000000,  // +0.0
+      0b0011110000000000,  // +1.0
+      0b0100000000000000,  // +2.0
+      0b0100001000000000,  // +3.0
+      0b0100010000000000,  // +4.0
+  };
+
+  values_buf_.resize(kNumBytes);
+  uint8_t* ptr = values_buf_.data();
+  for (int i = 0; i < NUM_VALUES; ++i) {
+    Float16(u16_vals[i]).ToLittleEndian(ptr);
+    values_[i].ptr = ptr;
+    ptr += kValueLen;
+  }
+
+  stats_[0]
+      .set_min(std::string(reinterpret_cast<const char*>(values_[0].ptr), kValueLen))

Review Comment:
   Thanks for confirmation!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-36036: [C++][Python][Parquet] Implement Float16 logical type [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1392928419


##########
cpp/src/parquet/arrow/test_util.h:
##########
@@ -65,12 +66,30 @@ struct Decimal256WithPrecisionAndScale {
   static constexpr int32_t scale = PRECISION - 1;
 };
 
+inline void RandomHalfFloatValues(int64_t n, uint32_t seed,

Review Comment:
   Given this is not Arrow-specific, putting it in `parquet/test_util.cc` would be fine (should probably also move the declaration then).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-36036: [C++][Python][Parquet] Implement Float16 logical type [arrow]

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1761743359

   @ianmcook If you're on this branch, then this best way to do this (for now) is to use [arrow::util::Float16](https://github.com/apache/arrow/blob/b1f5448b1439cfa8f6845ebee6e9d429ef8a1436/cpp/src/arrow/util/float16.h#L42) to downcast to `uint16_t` before calling `Append`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zhangjiashen commented on pull request #36073: GH-36036: [C++][Python][Parquet] Implement Float16 logical type

Posted by "zhangjiashen (via GitHub)" <gi...@apache.org>.
zhangjiashen commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1720082285

   > @zhangjiashen @wgtmac Just an update, but the Java implementation has recently risen in priority at my company so it's looking like I'll need to continue the work on my branch and (hopefully) have a PR up in the next week or two. However, let me know if there's already been significant work done. If so, I'd be happy to credit/co-author the changes. Regardless, you're welcome to contribute once the PR is up!
   
   @benibus I am working on Java implementation by continuing your changes and will create a PR once it is ready by end of this week


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] benibus commented on pull request #36073: GH-36036: [C++][Python][Parquet] Implement Float16 logical type

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1673864035

   @wgtmac Hopefully this branch can be of some use: https://github.com/benibus/parquet-mr/tree/PARQUET-1647-float16. It should cover most of the initial boilerplate, at least.
   
   For generating the thrift definitions, I installed the [pending parquet-format branch](https://github.com/apache/parquet-format/pull/184) to my local Maven repo and edited the pom.xml to target the SNAPSHOT version. There may be a better way of doing that kind of thing, of course...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #36073: GH-36036: [C++][Python][Parquet] Implement Float16 logical type

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1674123136

   > For generating the thrift definitions, I installed the [pending parquet-format branch](https://github.com/apache/parquet-format/pull/184) to my local Maven repo and edited the pom.xml to target the SNAPSHOT version. There may be a better way of doing that kind of thing, of course...
   
   Thank you very much! I did the same thing to install the unmerged thrift PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #36073: GH-36036: [C++][Parquet] Implement Float16 logical type

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1600982105

   We probably want a Java implementation as well according to https://github.com/apache/parquet-format/pull/184#issuecomment-1428363226
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade commented on a diff in pull request #36073: GH-36036: [C++][Parquet] Implement Float16 logical type

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.
zeroshade commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1231081383


##########
cpp/src/parquet/float_internal.h:
##########
@@ -0,0 +1,61 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <cstdint>
+#include <cstring>
+
+#include "arrow/util/bit_util.h"
+#include "arrow/util/ubsan.h"
+#include "parquet/types.h"
+
+namespace parquet {
+
+struct float16 {

Review Comment:
   I was going to make this same suggestion, so +1 from me on putting these helpers into `arrow/util`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #36073: GH-36036: [C++][Parquet] Implement Float16 logical type

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1233085244


##########
cpp/src/parquet/statistics.cc:
##########
@@ -277,11 +278,54 @@ template <bool is_signed>
 struct CompareHelper<FLBAType, is_signed>
     : public BinaryLikeCompareHelperBase<FLBAType, is_signed> {};
 
+struct Float16CompareHelper {
+  using T = FLBA;
+
+  static T DefaultMin() { return T{float16::max_ptr()}; }
+  static T DefaultMax() { return T{float16::min_ptr()}; }

Review Comment:
   Thanks for the explanation!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #36073: GH-36036: [C++][Parquet] Implement Float16 logical type

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1597424673

   By the way:
   
   > In the interest of scope, this PR does not currently deal with arrow integration and byte split encoding - although we will want both of these features resolved before the proposal is approved.
   
   Do you plan to tackle Arrow integration in a subsequent PR? This PR cannot be integrated before the proposal is approved, AFAICT.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zhangjiashen commented on pull request #36073: GH-36036: [C++][Python][Parquet] Implement Float16 logical type

Posted by "zhangjiashen (via GitHub)" <gi...@apache.org>.
zhangjiashen commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1722441679

   > @zhangjiashen Awesome, thanks! Just let us know if anything changes.
   > 
   > BTW, when you do open the PR, there's already any existing ticket that you can reference [here](https://issues.apache.org/jira/browse/PARQUET-1647) - so no need to create a new one.
   
   @wgtmac  @benibus @anjakefala , created a [PR](https://github.com/apache/parquet-mr/pull/1142) for some changes, please help take a look when you get a chance


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-36036: [C++][Python][Parquet] Implement Float16 logical type [arrow]

Posted by "conbench-apache-arrow[bot] (via GitHub)" <gi...@apache.org>.
conbench-apache-arrow[bot] commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1812666696

   After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit b55d13c16eb25f3264645e53cc03aa1f7d753b25.
   
   There were no benchmark performance regressions. 🎉
   
   The [full Conbench report](https://github.com/apache/arrow/runs/18709345043) has more details. It also includes information about 13 possible false positives for unstable benchmarks that are known to sometimes produce them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-36036: [C++][Python][Parquet] Implement Float16 logical type [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1392808624


##########
cpp/src/arrow/util/float16.h:
##########
@@ -0,0 +1,207 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <array>
+#include <cstdint>
+#include <cstring>
+#include <iosfwd>
+#include <limits>
+#include <type_traits>
+
+#include "arrow/util/endian.h"
+#include "arrow/util/macros.h"
+#include "arrow/util/ubsan.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// \brief Class representing an IEEE half-precision float, encoded as a `uint16_t`
+///
+/// The exact format is as follows (from LSB to MSB):
+/// - bits 0-10:  mantissa
+/// - bits 10-15: exponent
+/// - bit 15:     sign
+///
+class ARROW_EXPORT Float16 {
+ public:
+  Float16() = default;
+  explicit Float16(float f) : Float16(FromFloat(f)) {}
+  explicit Float16(double d) : Float16(FromDouble(d)) {}
+  template <typename T,
+            typename std::enable_if_t<std::is_convertible_v<T, double>>* = NULLPTR>
+  explicit Float16(T v) : Float16(static_cast<double>(v)) {}
+
+  /// \brief Create a `Float16` from its exact binary representation
+  constexpr static Float16 FromBits(uint16_t bits) { return Float16{bits, bool{}}; }
+  /// \brief Create a `Float16` from a 32-bit float (may lose precision)
+  static Float16 FromFloat(float f);
+  /// \brief Create a `Float16` from a 64-bit float (may lose precision)
+  static Float16 FromDouble(double d);
+
+  /// \brief Read a `Float16` from memory in native-endian byte order
+  static Float16 FromBytes(const uint8_t* src) {
+    return FromBits(SafeLoadAs<uint16_t>(src));
+  }
+
+  /// \brief Read a `Float16` from memory in little-endian byte order
+  static Float16 FromLittleEndian(const uint8_t* src) {
+    return FromBits(::arrow::bit_util::FromLittleEndian(SafeLoadAs<uint16_t>(src)));
+  }
+
+  /// \brief Read a `Float16` from memory in big-endian byte order
+  static Float16 FromBigEndian(const uint8_t* src) {
+    return FromBits(::arrow::bit_util::FromBigEndian(SafeLoadAs<uint16_t>(src)));
+  }
+
+  /// \brief Return the value's binary representation as a `uint16_t`
+  constexpr uint16_t bits() const { return bits_; }
+
+  /// \brief Return true if the value is negative (sign bit is set)
+  constexpr bool signbit() const { return (bits_ & 0x8000) != 0; }
+
+  /// \brief Return true if the value is NaN
+  constexpr bool is_nan() const { return (bits_ & 0x7fff) > 0x7c00; }
+  /// \brief Return true if the value is positive/negative infinity
+  constexpr bool is_infinity() const { return (bits_ & 0x7fff) == 0x7c00; }
+  /// \brief Return true if the value is finite and not NaN
+  constexpr bool is_finite() const { return (bits_ & 0x7c00) != 0x7c00; }
+  /// \brief Return true if the value is positive/negative zero
+  constexpr bool is_zero() const { return (bits_ & 0x7fff) == 0; }
+
+  /// \brief Convert to a 32-bit float
+  float ToFloat() const;
+  /// \brief Convert to a 64-bit float
+  double ToDouble() const;
+
+  explicit operator float() const { return ToFloat(); }
+  explicit operator double() const { return ToDouble(); }
+
+  /// \brief Copy the value's bytes in native-endian byte order
+  void ToBytes(uint8_t* dest) const { std::memcpy(dest, &bits_, sizeof(bits_)); }
+  /// \brief Return the value's bytes in native-endian byte order
+  constexpr std::array<uint8_t, 2> ToBytes() const {
+#if ARROW_LITTLE_ENDIAN
+    return ToLittleEndian();
+#else
+    return ToBigEndian();
+#endif
+  }
+
+  /// \brief Copy the value's bytes in little-endian byte order
+  void ToLittleEndian(uint8_t* dest) const {
+    FromBits(::arrow::bit_util::ToLittleEndian(bits_)).ToBytes(dest);

Review Comment:
   This implementation is weird, why call `FromBits` here?
   ```suggestion
       const auto bytes = ToLittleEndian();
       std::memcpy(dest, bytes.data(), bytes.size());
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-36036: [C++][Python][Parquet] Implement Float16 logical type [arrow]

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1392926120


##########
cpp/src/parquet/arrow/test_util.h:
##########
@@ -65,12 +66,30 @@ struct Decimal256WithPrecisionAndScale {
   static constexpr int32_t scale = PRECISION - 1;
 };
 
+inline void RandomHalfFloatValues(int64_t n, uint32_t seed,

Review Comment:
   We could either create a corresponding `parquet/arrow/test_util.cc` or put it in `parquet/test_util.cc`. I'm not sure if the former is worth doing in this case, though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36073: GH-36036: [C++][Parquet] Implement Float16 logical type

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1234323885


##########
cpp/src/arrow/util/float16.h:
##########
@@ -0,0 +1,179 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <array>
+#include <cmath>
+#include <cstdint>
+#include <cstring>
+#include <limits>
+#include <type_traits>
+
+#include "arrow/util/bit_util.h"
+#include "arrow/util/ubsan.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// \brief Base class for an IEEE half-precision float, encoded as a `uint16_t`
+///
+/// The exact format is as follows (from MSB to LSB):
+/// - bit 0:     sign
+/// - bits 1-5:  exponent
+/// - bits 6-15: mantissa
+///
+/// NOTE: Methods in the class should not mutate the unerlying value or produce copies.
+/// Such functionality is delegated to subclasses.
+class Float16Base {
+ public:
+  Float16Base() = default;
+  constexpr explicit Float16Base(uint16_t value) : value_(value) {}
+
+  constexpr uint16_t bits() const { return value_; }
+  constexpr explicit operator uint16_t() const { return bits(); }
+
+  constexpr bool signbit() const { return (value_ & 0x8000) != 0; }
+
+  constexpr bool is_nan() const {
+    return (value_ & 0x7c00) == 0x7c00 && (value_ & 0x03ff) != 0;
+  }
+  constexpr bool is_infinity() const { return (value_ & 0x7fff) == 0x7c00; }
+  constexpr bool is_zero() const { return (value_ & 0x7fff) == 0; }
+
+  /// \brief Copy the value's bytes in native-endian byte order
+  void ToBytes(uint8_t* dest) const { std::memcpy(dest, &value_, sizeof(value_)); }
+  /// \brief Return the value's bytes in native-endian byte order
+  std::array<uint8_t, 2> ToBytes() const {
+    std::array<uint8_t, 2> bytes;
+    ToBytes(bytes.data());
+    return bytes;
+  }
+
+  void ToLittleEndian(uint8_t* dest) const {
+    Float16Base{bit_util::ToLittleEndian(value_)}.ToBytes(dest);
+  }
+  std::array<uint8_t, 2> ToLittleEndian() const {
+    return Float16Base{bit_util::ToLittleEndian(value_)}.ToBytes();
+  }
+
+  void ToBigEndian(uint8_t* dest) const {
+    Float16Base{bit_util::ToBigEndian(value_)}.ToBytes(dest);
+  }
+  std::array<uint8_t, 2> ToBigEndian() const {
+    return Float16Base{bit_util::ToBigEndian(value_)}.ToBytes();
+  }
+
+  friend constexpr bool operator==(Float16Base lhs, Float16Base rhs) {
+    if (lhs.is_nan() || rhs.is_nan()) return false;
+    return Float16Base::CompareEq(lhs, rhs);
+  }
+  friend constexpr bool operator!=(Float16Base lhs, Float16Base rhs) {
+    return !(lhs == rhs);

Review Comment:
   Uh. It appears you're right. I must have been misremembering the (wacky) rules...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] benibus commented on pull request #36073: GH-36036: [C++][Parquet] Implement Float16 logical type

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1603331324

   @wgtmac If you're willing to do that then it'd be greatly appreciated. I'll try to help where I can (or at least get some more eyes on it).
   
   > Well, if Arrow integration does not come in this PR, then at least some testing of reading/writing using non-Arrow Parquet APIs should still be added.
   
   I'm planning to add Arrow support to this PR in the coming days, but I can add non-Arrow read/write tests if we need more coverage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] benibus commented on a diff in pull request #36073: GH-36036: [C++][Parquet] Implement Float16 logical type

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1231158742


##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -972,12 +1016,20 @@ TEST_F(TestStatisticsSortOrderFLBA, UnknownSortOrder) {
   ASSERT_FALSE(cc_metadata->is_stats_set());
 }
 
+template <typename T>
+static std::string EncodeValue(const T& val) {
+  return std::string(reinterpret_cast<const char*>(&val), sizeof(val));
+}
+static std::string EncodeValue(const FLBA& val, int length = sizeof(uint16_t)) {
+  return std::string(reinterpret_cast<const char*>(val.ptr), length);
+}
+
 template <typename Stats, typename Array, typename T = typename Array::value_type>
 void AssertMinMaxAre(Stats stats, const Array& values, T expected_min, T expected_max) {
   stats->Update(values.data(), values.size(), 0);
   ASSERT_TRUE(stats->HasMinMax());
-  EXPECT_EQ(stats->min(), expected_min);
-  EXPECT_EQ(stats->max(), expected_max);
+  EXPECT_EQ(stats->EncodeMin(), EncodeValue(expected_min));
+  EXPECT_EQ(stats->EncodeMax(), EncodeValue(expected_max));

Review Comment:
   Yeah basically. The testing header does define one but it expects a predetermined length (which is not 2).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] benibus commented on a diff in pull request #36073: GH-36036: [C++][Python][Parquet] Implement Float16 logical type

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1286326570


##########
cpp/src/parquet/statistics.cc:
##########
@@ -527,9 +616,28 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
 
   void IncrementNumValues(int64_t n) override { num_values_ += n; }
 
+  static bool IsMeaningfulLogicalType(LogicalType::Type::type type) {

Review Comment:
   This just happens to be first logical type that dramatically affects how the physical value (a `FLBA` in this case) should be interpreted for comparisons. But I tried to generalize things in case any similar logical types are added in the future.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #36073: GH-36036: [C++][Python][Parquet] Implement Float16 logical type

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1288809202


##########
cpp/src/parquet/statistics.cc:
##########
@@ -412,9 +492,9 @@ TypedComparatorImpl</*is_signed=*/false, Int32Type>::GetMinMax(const int32_t* va
   return {SafeCopy<int32_t>(min), SafeCopy<int32_t>(max)};
 }
 
-template <bool is_signed, typename DType>
+template <bool is_signed, typename DType, typename Helper>
 std::pair<typename DType::c_type, typename DType::c_type>
-TypedComparatorImpl<is_signed, DType>::GetMinMax(const ::arrow::Array& values) {
+TypedComparatorImpl<is_signed, DType, Helper>::GetMinMax(const ::arrow::Array& values) {

Review Comment:
   We probably need to check if ColumnIndex is generated as expected. Especially its boundary order is different from raw FLBA type.



##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -1082,6 +1115,45 @@ void TestStatisticsSortOrder<FLBAType>::SetValues() {
       .set_max(std::string(reinterpret_cast<const char*>(&vals[8][0]), FLBA_LENGTH));
 }
 
+template <>
+void TestStatisticsSortOrder<Float16LogicalType>::AddNodes(std::string name) {
+  auto node =
+      schema::PrimitiveNode::Make(name, Repetition::REQUIRED, LogicalType::Float16(),
+                                  Type::FIXED_LEN_BYTE_ARRAY, sizeof(uint16_t));
+  fields_.push_back(std::move(node));
+}
+
+template <>
+void TestStatisticsSortOrder<Float16LogicalType>::SetValues() {
+  constexpr int kValueLen = 2;
+  constexpr int kNumBytes = NUM_VALUES * kValueLen;
+
+  const uint16_t u16_vals[NUM_VALUES] = {
+      0b1100010100000000,  // -5.0
+      0b1100010000000000,  // -4.0
+      0b1100001000000000,  // -3.0
+      0b1100000000000000,  // -2.0
+      0b1011110000000000,  // -1.0
+      0b0000000000000000,  // +0.0
+      0b0011110000000000,  // +1.0
+      0b0100000000000000,  // +2.0
+      0b0100001000000000,  // +3.0
+      0b0100010000000000,  // +4.0
+  };
+
+  values_buf_.resize(kNumBytes);
+  uint8_t* ptr = values_buf_.data();
+  for (int i = 0; i < NUM_VALUES; ++i) {
+    Float16(u16_vals[i]).ToLittleEndian(ptr);
+    values_[i].ptr = ptr;
+    ptr += kValueLen;
+  }
+
+  stats_[0]
+      .set_min(std::string(reinterpret_cast<const char*>(values_[0].ptr), kValueLen))

Review Comment:
   QQ: float16 stats simply encodes values in raw FLBA (w/ length 2) in little endian?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] benibus commented on a diff in pull request #36073: GH-36036: [C++][Python][Parquet] Implement Float16 logical type

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1286310533


##########
cpp/src/arrow/util/float16.h:
##########
@@ -0,0 +1,198 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <array>
+#include <cstdint>
+#include <cstring>
+#include <iosfwd>
+#include <limits>
+#include <type_traits>
+
+#include "arrow/util/endian.h"
+#include "arrow/util/ubsan.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// \brief Base class for an IEEE half-precision float, encoded as a `uint16_t`
+///
+/// The exact format is as follows (from MSB to LSB):
+/// - bit 0:     sign
+/// - bits 1-5:  exponent
+/// - bits 6-15: mantissa
+///
+/// NOTE: Methods in the class should not mutate the unerlying value or produce copies.

Review Comment:
   Honestly, the only reason for the base class was to make testing on the Parquet side more convenient, since the value data needs to be stored externally (see `BufferedFloat16` [here](https://github.com/apache/arrow/blob/2f06b0b0577f53547da9bc471d69c49749ae8e89/cpp/src/parquet/statistics_test.cc#L66)). I'm not sure if this is a good enough justification for a public API though - and I agree that a single class would be better, so I'll try a different approach for the tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] benibus commented on pull request #36073: GH-36036: [C++][Parquet] Implement Float16 logical type

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1642291228

   @wgtmac Are you still interested in working on the Java side of this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-36036: [C++][Python][Parquet] Implement Float16 logical type [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1392808890


##########
cpp/src/arrow/util/float16.h:
##########
@@ -0,0 +1,207 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <array>
+#include <cstdint>
+#include <cstring>
+#include <iosfwd>
+#include <limits>
+#include <type_traits>
+
+#include "arrow/util/endian.h"
+#include "arrow/util/macros.h"
+#include "arrow/util/ubsan.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// \brief Class representing an IEEE half-precision float, encoded as a `uint16_t`
+///
+/// The exact format is as follows (from LSB to MSB):
+/// - bits 0-10:  mantissa
+/// - bits 10-15: exponent
+/// - bit 15:     sign
+///
+class ARROW_EXPORT Float16 {
+ public:
+  Float16() = default;
+  explicit Float16(float f) : Float16(FromFloat(f)) {}
+  explicit Float16(double d) : Float16(FromDouble(d)) {}
+  template <typename T,
+            typename std::enable_if_t<std::is_convertible_v<T, double>>* = NULLPTR>
+  explicit Float16(T v) : Float16(static_cast<double>(v)) {}
+
+  /// \brief Create a `Float16` from its exact binary representation
+  constexpr static Float16 FromBits(uint16_t bits) { return Float16{bits, bool{}}; }
+  /// \brief Create a `Float16` from a 32-bit float (may lose precision)
+  static Float16 FromFloat(float f);
+  /// \brief Create a `Float16` from a 64-bit float (may lose precision)
+  static Float16 FromDouble(double d);
+
+  /// \brief Read a `Float16` from memory in native-endian byte order
+  static Float16 FromBytes(const uint8_t* src) {
+    return FromBits(SafeLoadAs<uint16_t>(src));
+  }
+
+  /// \brief Read a `Float16` from memory in little-endian byte order
+  static Float16 FromLittleEndian(const uint8_t* src) {
+    return FromBits(::arrow::bit_util::FromLittleEndian(SafeLoadAs<uint16_t>(src)));
+  }
+
+  /// \brief Read a `Float16` from memory in big-endian byte order
+  static Float16 FromBigEndian(const uint8_t* src) {
+    return FromBits(::arrow::bit_util::FromBigEndian(SafeLoadAs<uint16_t>(src)));
+  }
+
+  /// \brief Return the value's binary representation as a `uint16_t`
+  constexpr uint16_t bits() const { return bits_; }
+
+  /// \brief Return true if the value is negative (sign bit is set)
+  constexpr bool signbit() const { return (bits_ & 0x8000) != 0; }
+
+  /// \brief Return true if the value is NaN
+  constexpr bool is_nan() const { return (bits_ & 0x7fff) > 0x7c00; }
+  /// \brief Return true if the value is positive/negative infinity
+  constexpr bool is_infinity() const { return (bits_ & 0x7fff) == 0x7c00; }
+  /// \brief Return true if the value is finite and not NaN
+  constexpr bool is_finite() const { return (bits_ & 0x7c00) != 0x7c00; }
+  /// \brief Return true if the value is positive/negative zero
+  constexpr bool is_zero() const { return (bits_ & 0x7fff) == 0; }
+
+  /// \brief Convert to a 32-bit float
+  float ToFloat() const;
+  /// \brief Convert to a 64-bit float
+  double ToDouble() const;
+
+  explicit operator float() const { return ToFloat(); }
+  explicit operator double() const { return ToDouble(); }
+
+  /// \brief Copy the value's bytes in native-endian byte order
+  void ToBytes(uint8_t* dest) const { std::memcpy(dest, &bits_, sizeof(bits_)); }
+  /// \brief Return the value's bytes in native-endian byte order
+  constexpr std::array<uint8_t, 2> ToBytes() const {
+#if ARROW_LITTLE_ENDIAN
+    return ToLittleEndian();
+#else
+    return ToBigEndian();
+#endif
+  }
+
+  /// \brief Copy the value's bytes in little-endian byte order
+  void ToLittleEndian(uint8_t* dest) const {
+    FromBits(::arrow::bit_util::ToLittleEndian(bits_)).ToBytes(dest);
+  }
+  /// \brief Return the value's bytes in little-endian byte order
+  constexpr std::array<uint8_t, 2> ToLittleEndian() const {
+#if ARROW_LITTLE_ENDIAN
+    return {uint8_t(bits_ & 0xff), uint8_t(bits_ >> 8)};
+#else
+    return {uint8_t(bits_ >> 8), uint8_t(bits_ & 0xff)};
+#endif
+  }
+
+  /// \brief Copy the value's bytes in big-endian byte order
+  void ToBigEndian(uint8_t* dest) const {
+    FromBits(::arrow::bit_util::ToBigEndian(bits_)).ToBytes(dest);

Review Comment:
   Same here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-36036: [C++][Python][Parquet] Implement Float16 logical type [arrow]

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1812188618

   Also we're going to release this in parquet-2.10
   
   https://github.com/apache/parquet-format/pull/219


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] bkietz commented on a diff in pull request #36073: GH-36036: [C++][Python][Parquet] Implement Float16 logical type

Posted by "bkietz (via GitHub)" <gi...@apache.org>.
bkietz commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1287267354


##########
cpp/src/parquet/statistics.cc:
##########
@@ -277,11 +296,42 @@ template <bool is_signed>
 struct CompareHelper<FLBAType, is_signed>
     : public BinaryLikeCompareHelperBase<FLBAType, is_signed> {};
 
+struct Float16CompareHelper {

Review Comment:
   Is there a reason this can't be another specialization of `CompareHelper`? seems like it'd save some complexity below



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #36073: GH-36036: [C++][Python][Parquet] Implement Float16 logical type

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1671712574

   @benibus There's an update on my side. @zhangjiashen is interested in working on this in parquet-mr. I will help him make some progress.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] benibus commented on pull request #36073: GH-36036: [C++][Parquet] Implement Float16 logical type

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1642366402

   No worries! I'll ping some others and keep you updated if necessary.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] benibus commented on pull request #36073: GH-36036: [C++][Parquet] Implement Float16 logical type

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1593273285

   Thanks! Before moving forward on some of the suggestions, I just want to ensure that I interpreted the endianness requirement from the proposal correctly. As per [the spec](https://github.com/apache/parquet-format/pull/184): 
   
   > The `FLOAT16` annotation represents half-precision floating-point numbers in the 2-byte IEEE little-endian format.
   
   Is this in line with what I've done here (i.e. the `FLBA` itself representing a little-endian binary float) or does it imply something else?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #36073: GH-36036: [C++][Parquet] Implement Float16 logical type

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1601147229

   > We probably want a Java implementation as well according to [apache/parquet-format#184 (comment)](https://github.com/apache/parquet-format/pull/184#issuecomment-1428363226)
   
   OK, if that is a must. I can spare some time to do this. Would be good after this PR gets completed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36073: GH-36036: [C++][Parquet] Implement Float16 logical type

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1234187461


##########
cpp/src/arrow/util/float16.h:
##########
@@ -0,0 +1,179 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <array>
+#include <cmath>
+#include <cstdint>
+#include <cstring>
+#include <limits>
+#include <type_traits>
+
+#include "arrow/util/bit_util.h"
+#include "arrow/util/ubsan.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// \brief Base class for an IEEE half-precision float, encoded as a `uint16_t`
+///
+/// The exact format is as follows (from MSB to LSB):
+/// - bit 0:     sign
+/// - bits 1-5:  exponent
+/// - bits 6-15: mantissa
+///
+/// NOTE: Methods in the class should not mutate the unerlying value or produce copies.
+/// Such functionality is delegated to subclasses.
+class Float16Base {
+ public:
+  Float16Base() = default;
+  constexpr explicit Float16Base(uint16_t value) : value_(value) {}
+
+  constexpr uint16_t bits() const { return value_; }
+  constexpr explicit operator uint16_t() const { return bits(); }
+
+  constexpr bool signbit() const { return (value_ & 0x8000) != 0; }
+
+  constexpr bool is_nan() const {
+    return (value_ & 0x7c00) == 0x7c00 && (value_ & 0x03ff) != 0;
+  }
+  constexpr bool is_infinity() const { return (value_ & 0x7fff) == 0x7c00; }
+  constexpr bool is_zero() const { return (value_ & 0x7fff) == 0; }
+
+  /// \brief Copy the value's bytes in native-endian byte order
+  void ToBytes(uint8_t* dest) const { std::memcpy(dest, &value_, sizeof(value_)); }
+  /// \brief Return the value's bytes in native-endian byte order
+  std::array<uint8_t, 2> ToBytes() const {
+    std::array<uint8_t, 2> bytes;
+    ToBytes(bytes.data());
+    return bytes;
+  }
+
+  void ToLittleEndian(uint8_t* dest) const {
+    Float16Base{bit_util::ToLittleEndian(value_)}.ToBytes(dest);
+  }
+  std::array<uint8_t, 2> ToLittleEndian() const {
+    return Float16Base{bit_util::ToLittleEndian(value_)}.ToBytes();
+  }
+
+  void ToBigEndian(uint8_t* dest) const {
+    Float16Base{bit_util::ToBigEndian(value_)}.ToBytes(dest);
+  }
+  std::array<uint8_t, 2> ToBigEndian() const {
+    return Float16Base{bit_util::ToBigEndian(value_)}.ToBytes();
+  }
+
+  friend constexpr bool operator==(Float16Base lhs, Float16Base rhs) {
+    if (lhs.is_nan() || rhs.is_nan()) return false;
+    return Float16Base::CompareEq(lhs, rhs);
+  }
+  friend constexpr bool operator!=(Float16Base lhs, Float16Base rhs) {
+    return !(lhs == rhs);
+  }
+
+  friend constexpr bool operator<(Float16Base lhs, Float16Base rhs) {
+    if (lhs.is_nan() || rhs.is_nan()) return false;
+    return Float16Base::CompareLt(lhs, rhs);
+  }
+  friend constexpr bool operator>(Float16Base lhs, Float16Base rhs) { return rhs < lhs; }

Review Comment:
   Nit, but it's a bit weird to have this one delegate to its counterpart, while `operator>=` doesn't.



##########
cpp/src/arrow/util/float16.h:
##########
@@ -0,0 +1,179 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <array>
+#include <cmath>
+#include <cstdint>
+#include <cstring>
+#include <limits>
+#include <type_traits>
+
+#include "arrow/util/bit_util.h"
+#include "arrow/util/ubsan.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// \brief Base class for an IEEE half-precision float, encoded as a `uint16_t`
+///
+/// The exact format is as follows (from MSB to LSB):
+/// - bit 0:     sign
+/// - bits 1-5:  exponent
+/// - bits 6-15: mantissa
+///
+/// NOTE: Methods in the class should not mutate the unerlying value or produce copies.
+/// Such functionality is delegated to subclasses.
+class Float16Base {
+ public:
+  Float16Base() = default;
+  constexpr explicit Float16Base(uint16_t value) : value_(value) {}
+
+  constexpr uint16_t bits() const { return value_; }
+  constexpr explicit operator uint16_t() const { return bits(); }
+
+  constexpr bool signbit() const { return (value_ & 0x8000) != 0; }
+
+  constexpr bool is_nan() const {
+    return (value_ & 0x7c00) == 0x7c00 && (value_ & 0x03ff) != 0;
+  }
+  constexpr bool is_infinity() const { return (value_ & 0x7fff) == 0x7c00; }
+  constexpr bool is_zero() const { return (value_ & 0x7fff) == 0; }
+
+  /// \brief Copy the value's bytes in native-endian byte order
+  void ToBytes(uint8_t* dest) const { std::memcpy(dest, &value_, sizeof(value_)); }
+  /// \brief Return the value's bytes in native-endian byte order
+  std::array<uint8_t, 2> ToBytes() const {
+    std::array<uint8_t, 2> bytes;
+    ToBytes(bytes.data());
+    return bytes;
+  }
+
+  void ToLittleEndian(uint8_t* dest) const {
+    Float16Base{bit_util::ToLittleEndian(value_)}.ToBytes(dest);
+  }
+  std::array<uint8_t, 2> ToLittleEndian() const {
+    return Float16Base{bit_util::ToLittleEndian(value_)}.ToBytes();
+  }
+
+  void ToBigEndian(uint8_t* dest) const {

Review Comment:
   Converting to big-endian is probably not useful? (though harmless as well)



##########
cpp/src/arrow/util/float16.h:
##########
@@ -0,0 +1,179 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <array>
+#include <cmath>
+#include <cstdint>
+#include <cstring>
+#include <limits>
+#include <type_traits>
+
+#include "arrow/util/bit_util.h"
+#include "arrow/util/ubsan.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// \brief Base class for an IEEE half-precision float, encoded as a `uint16_t`
+///
+/// The exact format is as follows (from MSB to LSB):
+/// - bit 0:     sign
+/// - bits 1-5:  exponent
+/// - bits 6-15: mantissa
+///
+/// NOTE: Methods in the class should not mutate the unerlying value or produce copies.
+/// Such functionality is delegated to subclasses.
+class Float16Base {
+ public:
+  Float16Base() = default;
+  constexpr explicit Float16Base(uint16_t value) : value_(value) {}
+
+  constexpr uint16_t bits() const { return value_; }
+  constexpr explicit operator uint16_t() const { return bits(); }
+
+  constexpr bool signbit() const { return (value_ & 0x8000) != 0; }
+
+  constexpr bool is_nan() const {
+    return (value_ & 0x7c00) == 0x7c00 && (value_ & 0x03ff) != 0;
+  }
+  constexpr bool is_infinity() const { return (value_ & 0x7fff) == 0x7c00; }
+  constexpr bool is_zero() const { return (value_ & 0x7fff) == 0; }
+
+  /// \brief Copy the value's bytes in native-endian byte order
+  void ToBytes(uint8_t* dest) const { std::memcpy(dest, &value_, sizeof(value_)); }
+  /// \brief Return the value's bytes in native-endian byte order
+  std::array<uint8_t, 2> ToBytes() const {
+    std::array<uint8_t, 2> bytes;
+    ToBytes(bytes.data());
+    return bytes;
+  }
+
+  void ToLittleEndian(uint8_t* dest) const {
+    Float16Base{bit_util::ToLittleEndian(value_)}.ToBytes(dest);
+  }
+  std::array<uint8_t, 2> ToLittleEndian() const {
+    return Float16Base{bit_util::ToLittleEndian(value_)}.ToBytes();
+  }
+
+  void ToBigEndian(uint8_t* dest) const {
+    Float16Base{bit_util::ToBigEndian(value_)}.ToBytes(dest);
+  }
+  std::array<uint8_t, 2> ToBigEndian() const {
+    return Float16Base{bit_util::ToBigEndian(value_)}.ToBytes();
+  }
+
+  friend constexpr bool operator==(Float16Base lhs, Float16Base rhs) {
+    if (lhs.is_nan() || rhs.is_nan()) return false;
+    return Float16Base::CompareEq(lhs, rhs);
+  }
+  friend constexpr bool operator!=(Float16Base lhs, Float16Base rhs) {
+    return !(lhs == rhs);

Review Comment:
   If either value is NaN, we should return false but this will return true.



##########
cpp/src/parquet/statistics.cc:
##########
@@ -53,6 +55,25 @@ namespace {
 constexpr int value_length(int value_length, const ByteArray& value) { return value.len; }
 constexpr int value_length(int type_length, const FLBA& value) { return type_length; }
 
+// Static "constants" for normalizing float16 min/max values. These need to be expressed
+// as pointers because `Float16LogicalType` represents an FLBA.
+const uint8_t* float16_lowest() {
+  static const auto bytes = std::numeric_limits<Float16>::lowest().ToLittleEndian();

Review Comment:
   This could probably made `constexpr` by spelling out the `#if ARROW_LITTLE_ENDIAN` dance explicitly in `ToLittleEndian` and friends.
   Not sure that's useful here, but worth remembering.



##########
cpp/src/parquet/statistics.cc:
##########
@@ -277,11 +298,42 @@ template <bool is_signed>
 struct CompareHelper<FLBAType, is_signed>
     : public BinaryLikeCompareHelperBase<FLBAType, is_signed> {};
 
+struct Float16CompareHelper {
+  using T = FLBA;
+
+  static T DefaultMin() { return T{float16_max()}; }
+  static T DefaultMax() { return T{float16_lowest()}; }
+
+  static T Coalesce(T val, T fallback) {
+    return val.ptr != nullptr && Float16::FromLittleEndian(val.ptr).is_nan() ? fallback
+                                                                             : val;

Review Comment:
   Shouldn't this be, rather (also adding parentheses for clarity):
   ```suggestion
       return (val.ptr == nullptr || Float16::FromLittleEndian(val.ptr).is_nan()) ? fallback
                                                                                  : val;
   ```



##########
cpp/src/parquet/statistics.cc:
##########
@@ -277,11 +298,42 @@ template <bool is_signed>
 struct CompareHelper<FLBAType, is_signed>
     : public BinaryLikeCompareHelperBase<FLBAType, is_signed> {};
 
+struct Float16CompareHelper {
+  using T = FLBA;
+
+  static T DefaultMin() { return T{float16_max()}; }
+  static T DefaultMax() { return T{float16_lowest()}; }
+
+  static T Coalesce(T val, T fallback) {
+    return val.ptr != nullptr && Float16::FromLittleEndian(val.ptr).is_nan() ? fallback
+                                                                             : val;

Review Comment:
   That said, I don't think `Coalesce` can be called with a null FLBA, which should simplify this.



##########
cpp/src/parquet/statistics.cc:
##########
@@ -458,6 +540,16 @@ std::pair<ByteArray, ByteArray> TypedComparatorImpl<false, ByteArrayType>::GetMi
   return GetMinMaxBinaryHelper<false>(*this, values);
 }
 
+static LogicalType::Type::type LogicalTypeId(const ColumnDescriptor* descr) {

Review Comment:
   Not sure the `static` is required (aren't we already in the anonymouns namespace?).



##########
cpp/src/arrow/util/float16.h:
##########
@@ -0,0 +1,179 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <array>
+#include <cmath>
+#include <cstdint>
+#include <cstring>
+#include <limits>
+#include <type_traits>
+
+#include "arrow/util/bit_util.h"
+#include "arrow/util/ubsan.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// \brief Base class for an IEEE half-precision float, encoded as a `uint16_t`
+///
+/// The exact format is as follows (from MSB to LSB):
+/// - bit 0:     sign
+/// - bits 1-5:  exponent
+/// - bits 6-15: mantissa
+///
+/// NOTE: Methods in the class should not mutate the unerlying value or produce copies.
+/// Such functionality is delegated to subclasses.
+class Float16Base {
+ public:
+  Float16Base() = default;
+  constexpr explicit Float16Base(uint16_t value) : value_(value) {}
+
+  constexpr uint16_t bits() const { return value_; }
+  constexpr explicit operator uint16_t() const { return bits(); }
+
+  constexpr bool signbit() const { return (value_ & 0x8000) != 0; }
+
+  constexpr bool is_nan() const {
+    return (value_ & 0x7c00) == 0x7c00 && (value_ & 0x03ff) != 0;
+  }
+  constexpr bool is_infinity() const { return (value_ & 0x7fff) == 0x7c00; }
+  constexpr bool is_zero() const { return (value_ & 0x7fff) == 0; }
+
+  /// \brief Copy the value's bytes in native-endian byte order
+  void ToBytes(uint8_t* dest) const { std::memcpy(dest, &value_, sizeof(value_)); }
+  /// \brief Return the value's bytes in native-endian byte order
+  std::array<uint8_t, 2> ToBytes() const {
+    std::array<uint8_t, 2> bytes;
+    ToBytes(bytes.data());
+    return bytes;
+  }
+
+  void ToLittleEndian(uint8_t* dest) const {
+    Float16Base{bit_util::ToLittleEndian(value_)}.ToBytes(dest);
+  }
+  std::array<uint8_t, 2> ToLittleEndian() const {
+    return Float16Base{bit_util::ToLittleEndian(value_)}.ToBytes();
+  }
+
+  void ToBigEndian(uint8_t* dest) const {
+    Float16Base{bit_util::ToBigEndian(value_)}.ToBytes(dest);
+  }
+  std::array<uint8_t, 2> ToBigEndian() const {
+    return Float16Base{bit_util::ToBigEndian(value_)}.ToBytes();
+  }
+
+  friend constexpr bool operator==(Float16Base lhs, Float16Base rhs) {
+    if (lhs.is_nan() || rhs.is_nan()) return false;
+    return Float16Base::CompareEq(lhs, rhs);
+  }
+  friend constexpr bool operator!=(Float16Base lhs, Float16Base rhs) {
+    return !(lhs == rhs);
+  }
+
+  friend constexpr bool operator<(Float16Base lhs, Float16Base rhs) {
+    if (lhs.is_nan() || rhs.is_nan()) return false;
+    return Float16Base::CompareLt(lhs, rhs);
+  }
+  friend constexpr bool operator>(Float16Base lhs, Float16Base rhs) { return rhs < lhs; }
+
+  friend constexpr bool operator<=(Float16Base lhs, Float16Base rhs) {
+    if (lhs.is_nan() || rhs.is_nan()) return false;
+    return !Float16Base::CompareLt(rhs, lhs);
+  }
+  friend constexpr bool operator>=(Float16Base lhs, Float16Base rhs) {
+    if (lhs.is_nan() || rhs.is_nan()) return false;
+    return !Float16Base::CompareLt(lhs, rhs);
+  }
+
+  friend std::ostream& operator<<(std::ostream& os, Float16Base arg) {
+    return (os << arg.bits());
+  }
+
+ protected:
+  uint16_t value_;
+
+ private:
+  // Comparison helpers that assume neither operand is NaN
+  static constexpr bool CompareEq(Float16Base lhs, Float16Base rhs) {
+    return (lhs.bits() == rhs.bits()) || (lhs.is_zero() && rhs.is_zero());
+  }
+  static constexpr bool CompareLt(Float16Base lhs, Float16Base rhs) {
+    if (lhs.signbit()) {
+      if (rhs.signbit()) {
+        // Both are negative
+        return (lhs.bits() & 0x7fff) > (rhs.bits() & 0x7fff);
+      } else {
+        // Handle +/-0
+        return !lhs.is_zero() || rhs.bits() != 0;
+      }
+    } else if (rhs.signbit()) {
+      return false;
+    } else {
+      // Both are positive
+      return (lhs.bits() & 0x7fff) < (rhs.bits() & 0x7fff);

Review Comment:
   If they're both positive, then why do we need to AND the bits?



##########
cpp/src/arrow/util/float16_test.cc:
##########
@@ -0,0 +1,160 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <array>
+#include <utility>
+#include <vector>
+
+#include <gtest/gtest.h>
+
+#include "arrow/testing/gtest_util.h"
+#include "arrow/util/endian.h"
+#include "arrow/util/float16.h"
+#include "arrow/util/ubsan.h"
+
+namespace arrow {
+namespace util {
+namespace {
+
+template <typename T>
+using Limits = std::numeric_limits<T>;
+
+// Holds a float16 and its equivalent float32
+struct TestValue {
+  TestValue(Float16 f16, float f32) : f16(f16), f32(f32) {}
+  TestValue(uint16_t u16, float f32) : TestValue(Float16(u16), f32) {}
+
+  Float16 f16;
+  float f32;
+};
+
+#define GENERATE_OPERATOR(NAME, OP)                              \
+  struct NAME {                                                  \
+    std::pair<bool, bool> operator()(TestValue l, TestValue r) { \
+      return std::make_pair((l.f32 OP r.f32), (l.f16 OP r.f16)); \
+    }                                                            \
+  }
+
+GENERATE_OPERATOR(CompareEq, ==);
+GENERATE_OPERATOR(CompareNe, !=);
+GENERATE_OPERATOR(CompareLt, <);
+GENERATE_OPERATOR(CompareGt, >);
+GENERATE_OPERATOR(CompareLe, <=);
+GENERATE_OPERATOR(CompareGe, >=);
+
+#undef GENERATE_OPERATOR
+
+const std::vector<TestValue> g_test_values = {
+    TestValue(Limits<Float16>::min(), +0.00006104f),
+    TestValue(Limits<Float16>::max(), +65504.0f),
+    TestValue(Limits<Float16>::lowest(), -65504.0f),
+    TestValue(+Limits<Float16>::infinity(), +Limits<float>::infinity()),
+    TestValue(-Limits<Float16>::infinity(), -Limits<float>::infinity()),
+    // Multiple (semantically equivalent) NaN representations
+    TestValue(0x7fff, Limits<float>::quiet_NaN()),
+    TestValue(0xffff, Limits<float>::quiet_NaN()),
+    TestValue(0x7e00, Limits<float>::quiet_NaN()),
+    TestValue(0xfe00, Limits<float>::quiet_NaN()),
+    // Positive/negative zeroes
+    TestValue(0x0000, +0.0f),
+    TestValue(0x8000, -0.0f),
+    // Miscellaneous values. In general, they're chosen to test the sign/exponent and
+    // exponent/mantissa boundaries
+    TestValue(0x101c, +0.000502f),
+    TestValue(0x901c, -0.000502f),
+    TestValue(0x101d, +0.0005022f),
+    TestValue(0x901d, -0.0005022f),
+    TestValue(0x121c, +0.000746f),
+    TestValue(0x921c, -0.000746f),
+    TestValue(0x141c, +0.001004f),
+    TestValue(0x941c, -0.001004f),
+    TestValue(0x501c, +32.9f),
+    TestValue(0xd01c, -32.9f),
+    // A few subnormals for good measure
+    TestValue(0x001c, +0.0000017f),
+    TestValue(0x801c, -0.0000017f),
+    TestValue(0x021c, +0.0000332f),
+    TestValue(0x821c, -0.0000332f),
+};
+
+template <typename Operator>
+class Float16OperatorTest : public ::testing::Test {
+ public:
+  void TestCompare(const std::vector<TestValue>& test_values) {
+    const auto num_values = static_cast<int>(test_values.size());
+
+    // Check all combinations of operands in both directions
+    for (int offset = 0; offset < num_values; ++offset) {

Review Comment:
   This is a strange way to write the double loop. Why not simply:
   ```c++
   for (int i = 0; i < num_values; ++i) {
     for (int j = 0; i < num_values; ++j) {
   ```



##########
cpp/src/parquet/statistics.cc:
##########
@@ -525,6 +616,19 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
   bool Equals(const Statistics& raw_other) const override {
     if (physical_type() != raw_other.physical_type()) return false;
 
+    const auto logical_id = LogicalTypeId(*this);
+    switch (logical_id) {
+      // Only compare against logical types that influence the interpretation of the
+      // physical type
+      case LogicalType::Type::FLOAT16:
+        if (LogicalTypeId(raw_other) != logical_id) {
+          return false;

Review Comment:
   What if `this` is not a float16 but `raw_other` is?



##########
cpp/src/arrow/util/float16.h:
##########
@@ -0,0 +1,179 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <array>
+#include <cmath>
+#include <cstdint>
+#include <cstring>
+#include <limits>
+#include <type_traits>
+
+#include "arrow/util/bit_util.h"
+#include "arrow/util/ubsan.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// \brief Base class for an IEEE half-precision float, encoded as a `uint16_t`
+///
+/// The exact format is as follows (from MSB to LSB):
+/// - bit 0:     sign
+/// - bits 1-5:  exponent
+/// - bits 6-15: mantissa
+///
+/// NOTE: Methods in the class should not mutate the unerlying value or produce copies.
+/// Such functionality is delegated to subclasses.
+class Float16Base {
+ public:
+  Float16Base() = default;
+  constexpr explicit Float16Base(uint16_t value) : value_(value) {}
+
+  constexpr uint16_t bits() const { return value_; }
+  constexpr explicit operator uint16_t() const { return bits(); }
+
+  constexpr bool signbit() const { return (value_ & 0x8000) != 0; }
+
+  constexpr bool is_nan() const {
+    return (value_ & 0x7c00) == 0x7c00 && (value_ & 0x03ff) != 0;
+  }
+  constexpr bool is_infinity() const { return (value_ & 0x7fff) == 0x7c00; }
+  constexpr bool is_zero() const { return (value_ & 0x7fff) == 0; }
+
+  /// \brief Copy the value's bytes in native-endian byte order
+  void ToBytes(uint8_t* dest) const { std::memcpy(dest, &value_, sizeof(value_)); }
+  /// \brief Return the value's bytes in native-endian byte order
+  std::array<uint8_t, 2> ToBytes() const {
+    std::array<uint8_t, 2> bytes;
+    ToBytes(bytes.data());
+    return bytes;
+  }
+
+  void ToLittleEndian(uint8_t* dest) const {
+    Float16Base{bit_util::ToLittleEndian(value_)}.ToBytes(dest);
+  }
+  std::array<uint8_t, 2> ToLittleEndian() const {
+    return Float16Base{bit_util::ToLittleEndian(value_)}.ToBytes();
+  }
+
+  void ToBigEndian(uint8_t* dest) const {
+    Float16Base{bit_util::ToBigEndian(value_)}.ToBytes(dest);
+  }
+  std::array<uint8_t, 2> ToBigEndian() const {
+    return Float16Base{bit_util::ToBigEndian(value_)}.ToBytes();
+  }
+
+  friend constexpr bool operator==(Float16Base lhs, Float16Base rhs) {
+    if (lhs.is_nan() || rhs.is_nan()) return false;
+    return Float16Base::CompareEq(lhs, rhs);
+  }
+  friend constexpr bool operator!=(Float16Base lhs, Float16Base rhs) {
+    return !(lhs == rhs);
+  }
+
+  friend constexpr bool operator<(Float16Base lhs, Float16Base rhs) {
+    if (lhs.is_nan() || rhs.is_nan()) return false;
+    return Float16Base::CompareLt(lhs, rhs);
+  }
+  friend constexpr bool operator>(Float16Base lhs, Float16Base rhs) { return rhs < lhs; }
+
+  friend constexpr bool operator<=(Float16Base lhs, Float16Base rhs) {
+    if (lhs.is_nan() || rhs.is_nan()) return false;
+    return !Float16Base::CompareLt(rhs, lhs);
+  }
+  friend constexpr bool operator>=(Float16Base lhs, Float16Base rhs) {
+    if (lhs.is_nan() || rhs.is_nan()) return false;
+    return !Float16Base::CompareLt(lhs, rhs);
+  }
+
+  friend std::ostream& operator<<(std::ostream& os, Float16Base arg) {
+    return (os << arg.bits());
+  }
+
+ protected:
+  uint16_t value_;
+
+ private:
+  // Comparison helpers that assume neither operand is NaN
+  static constexpr bool CompareEq(Float16Base lhs, Float16Base rhs) {
+    return (lhs.bits() == rhs.bits()) || (lhs.is_zero() && rhs.is_zero());
+  }
+  static constexpr bool CompareLt(Float16Base lhs, Float16Base rhs) {
+    if (lhs.signbit()) {
+      if (rhs.signbit()) {
+        // Both are negative
+        return (lhs.bits() & 0x7fff) > (rhs.bits() & 0x7fff);

Review Comment:
   I don't think ANDing changes anything here?



##########
cpp/src/arrow/util/float16.h:
##########
@@ -0,0 +1,179 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <array>
+#include <cmath>
+#include <cstdint>
+#include <cstring>
+#include <limits>
+#include <type_traits>
+
+#include "arrow/util/bit_util.h"
+#include "arrow/util/ubsan.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// \brief Base class for an IEEE half-precision float, encoded as a `uint16_t`
+///
+/// The exact format is as follows (from MSB to LSB):
+/// - bit 0:     sign
+/// - bits 1-5:  exponent
+/// - bits 6-15: mantissa
+///
+/// NOTE: Methods in the class should not mutate the unerlying value or produce copies.
+/// Such functionality is delegated to subclasses.
+class Float16Base {
+ public:
+  Float16Base() = default;
+  constexpr explicit Float16Base(uint16_t value) : value_(value) {}
+
+  constexpr uint16_t bits() const { return value_; }
+  constexpr explicit operator uint16_t() const { return bits(); }
+
+  constexpr bool signbit() const { return (value_ & 0x8000) != 0; }
+
+  constexpr bool is_nan() const {
+    return (value_ & 0x7c00) == 0x7c00 && (value_ & 0x03ff) != 0;
+  }
+  constexpr bool is_infinity() const { return (value_ & 0x7fff) == 0x7c00; }
+  constexpr bool is_zero() const { return (value_ & 0x7fff) == 0; }
+
+  /// \brief Copy the value's bytes in native-endian byte order
+  void ToBytes(uint8_t* dest) const { std::memcpy(dest, &value_, sizeof(value_)); }
+  /// \brief Return the value's bytes in native-endian byte order
+  std::array<uint8_t, 2> ToBytes() const {
+    std::array<uint8_t, 2> bytes;
+    ToBytes(bytes.data());
+    return bytes;
+  }
+
+  void ToLittleEndian(uint8_t* dest) const {
+    Float16Base{bit_util::ToLittleEndian(value_)}.ToBytes(dest);
+  }
+  std::array<uint8_t, 2> ToLittleEndian() const {
+    return Float16Base{bit_util::ToLittleEndian(value_)}.ToBytes();
+  }
+
+  void ToBigEndian(uint8_t* dest) const {
+    Float16Base{bit_util::ToBigEndian(value_)}.ToBytes(dest);
+  }
+  std::array<uint8_t, 2> ToBigEndian() const {
+    return Float16Base{bit_util::ToBigEndian(value_)}.ToBytes();
+  }
+
+  friend constexpr bool operator==(Float16Base lhs, Float16Base rhs) {
+    if (lhs.is_nan() || rhs.is_nan()) return false;
+    return Float16Base::CompareEq(lhs, rhs);
+  }
+  friend constexpr bool operator!=(Float16Base lhs, Float16Base rhs) {
+    return !(lhs == rhs);
+  }
+
+  friend constexpr bool operator<(Float16Base lhs, Float16Base rhs) {
+    if (lhs.is_nan() || rhs.is_nan()) return false;
+    return Float16Base::CompareLt(lhs, rhs);
+  }
+  friend constexpr bool operator>(Float16Base lhs, Float16Base rhs) { return rhs < lhs; }
+
+  friend constexpr bool operator<=(Float16Base lhs, Float16Base rhs) {
+    if (lhs.is_nan() || rhs.is_nan()) return false;
+    return !Float16Base::CompareLt(rhs, lhs);
+  }
+  friend constexpr bool operator>=(Float16Base lhs, Float16Base rhs) {
+    if (lhs.is_nan() || rhs.is_nan()) return false;
+    return !Float16Base::CompareLt(lhs, rhs);
+  }
+
+  friend std::ostream& operator<<(std::ostream& os, Float16Base arg) {

Review Comment:
   This means we should include `<iosfwd>` (and by not defining the method in this header, we avoid including the entire iostream library).



##########
cpp/src/arrow/util/float16.h:
##########
@@ -0,0 +1,179 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <array>
+#include <cmath>
+#include <cstdint>
+#include <cstring>
+#include <limits>
+#include <type_traits>
+
+#include "arrow/util/bit_util.h"
+#include "arrow/util/ubsan.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// \brief Base class for an IEEE half-precision float, encoded as a `uint16_t`
+///
+/// The exact format is as follows (from MSB to LSB):
+/// - bit 0:     sign
+/// - bits 1-5:  exponent
+/// - bits 6-15: mantissa
+///
+/// NOTE: Methods in the class should not mutate the unerlying value or produce copies.
+/// Such functionality is delegated to subclasses.
+class Float16Base {
+ public:
+  Float16Base() = default;
+  constexpr explicit Float16Base(uint16_t value) : value_(value) {}
+
+  constexpr uint16_t bits() const { return value_; }
+  constexpr explicit operator uint16_t() const { return bits(); }
+
+  constexpr bool signbit() const { return (value_ & 0x8000) != 0; }
+
+  constexpr bool is_nan() const {
+    return (value_ & 0x7c00) == 0x7c00 && (value_ & 0x03ff) != 0;
+  }
+  constexpr bool is_infinity() const { return (value_ & 0x7fff) == 0x7c00; }
+  constexpr bool is_zero() const { return (value_ & 0x7fff) == 0; }
+
+  /// \brief Copy the value's bytes in native-endian byte order
+  void ToBytes(uint8_t* dest) const { std::memcpy(dest, &value_, sizeof(value_)); }
+  /// \brief Return the value's bytes in native-endian byte order
+  std::array<uint8_t, 2> ToBytes() const {
+    std::array<uint8_t, 2> bytes;
+    ToBytes(bytes.data());
+    return bytes;
+  }
+
+  void ToLittleEndian(uint8_t* dest) const {

Review Comment:
   It would be nice to add docstrings for all public methods.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] benibus commented on pull request #36073: GH-36036: [C++][Python][Parquet] Implement Float16 logical type

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1719975893

   @zhangjiashen @wgtmac Just an update, but the Java implementation has recently risen in priority at my company so it's looking like I'll need to continue the work on my branch and (hopefully) have a PR up in the next week or two. However, let me know if there's already been significant work done. If so, I'd be happy to credit/co-author the changes. Regardless, you're welcome to contribute once the PR is up!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] anjakefala commented on pull request #36073: GH-36036: [C++][Python][Parquet] Implement Float16 logical type

Posted by "anjakefala (via GitHub)" <gi...@apache.org>.
anjakefala commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1696123874

   @wgtmac @zhangjiashen for the parquet-mr implementation, could you let me know when you have a branch/PR available on GitHub? =) I published the original request to add float16 to Parquet, and would love to keep track of the implementation work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #36073: GH-36036: [C++][Python][Parquet] Implement Float16 logical type

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1696658220

   @zhangjiashen Do you have a rough ETA? Or it would be good to open an PR with the current progress so we can better keep track of it. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-36036: [C++][Python][Parquet] Implement Float16 logical type [arrow]

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1809202663

   Now that the proposal is merged, would anyone be willing to take a final look and potentially merge this? I've gone ahead and rebased for good measure.
   
   I should also mention that the [Go implementation](https://github.com/apache/arrow/pull/37599) was merged today, but it'll eventually need this PR's `parquet.thrift` to re-generate its files in the future.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-36036: [C++][Python][Parquet] Implement Float16 logical type [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1392822349


##########
cpp/src/parquet/arrow/test_util.h:
##########
@@ -65,12 +66,30 @@ struct Decimal256WithPrecisionAndScale {
   static constexpr int32_t scale = PRECISION - 1;
 };
 
+inline void RandomHalfFloatValues(int64_t n, uint32_t seed,

Review Comment:
   We don't need to line the definition here, can we put in a `.cc` file instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-36036: [C++][Python][Parquet] Implement Float16 logical type [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1812162950

   LGTM, I'm waiting for a last CI check and will merge afterwards.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] bkietz commented on a diff in pull request #36073: GH-36036: [C++][Python][Parquet] Implement Float16 logical type

Posted by "bkietz (via GitHub)" <gi...@apache.org>.
bkietz commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1283641666


##########
cpp/src/arrow/util/float16.cc:
##########
@@ -0,0 +1,28 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <ostream>
+
+#include "arrow/util/float16.h"
+
+namespace arrow {
+namespace util {
+
+std::ostream& operator<<(std::ostream& os, Float16Base arg) { return (os << arg.bits()); }

Review Comment:
   I think a missing piece from this PR is conversion to/from `float`. With that defined, we could print the equivalent `float` value instead and test that each `float16_test_cc::TestValue` is internally consistent



##########
cpp/src/parquet/arrow/reader_internal.cc:
##########
@@ -713,6 +715,77 @@ Status TransferDecimal(RecordReader* reader, MemoryPool* pool,
   return Status::OK();
 }
 
+static inline Status ConvertToHalfFloat(const Array& array,
+                                        const std::shared_ptr<DataType>& type,
+                                        MemoryPool* pool, std::shared_ptr<Array>* out) {
+  constexpr int32_t byte_width = sizeof(uint16_t);
+  DCHECK_EQ(checked_cast<const ::arrow::HalfFloatType&>(*type).byte_width(), byte_width);
+
+  // We read the halffloat (uint16_t) bytes from a raw binary array, in which they're
+  // assumed to be little-endian.
+  const auto& binary_array = checked_cast<const ::arrow::FixedSizeBinaryArray&>(array);
+  DCHECK_EQ(checked_cast<const ::arrow::FixedSizeBinaryType&>(*binary_array.type())
+                .byte_width(),
+            byte_width);
+
+  // Number of elements in the halffloat array
+  const int64_t length = binary_array.length();
+  // Allocate data for the output halffloat array
+  ARROW_ASSIGN_OR_RAISE(auto data, ::arrow::AllocateBuffer(length * byte_width, pool));
+  uint8_t* out_ptr = data->mutable_data();
+
+  const int64_t null_count = binary_array.null_count();
+  // Copy the values to the output array in native-endian format
+  if (null_count > 0) {
+    for (int64_t i = 0; i < length; ++i, out_ptr += byte_width) {
+      Float16 f16{0};
+      if (binary_array.IsValid(i)) {
+        const uint8_t* in_ptr = binary_array.GetValue(i);
+        f16 = Float16::FromLittleEndian(in_ptr);
+      }
+      f16.ToBytes(out_ptr);
+    }
+  } else {
+#if ARROW_LITTLE_ENDIAN
+    // No need to byte-swap, so do a simple copy
+    std::memcpy(out_ptr, binary_array.raw_values(), length * byte_width);
+#else
+    for (int64_t i = 0; i < length; ++i, out_ptr += byte_width) {
+      const uint8_t* in_ptr = binary_array.GetValue(i);
+      Float16::FromLittleEndian(in_ptr).ToBytes(out_ptr);
+    }
+#endif
+  }
+
+  *out = std::make_shared<::arrow::HalfFloatArray>(
+      type, length, std::move(data), binary_array.null_bitmap(), null_count);
+  return Status::OK();
+}
+
+/// \brief Convert an arrow::BinaryArray to an arrow::HalfFloatArray
+/// We do this by:
+/// 1. Creating an arrow::BinaryArray from the RecordReader's builder
+/// 2. Allocating a buffer for the arrow::HalfFloatArray
+/// 3. Converting the little-endian bytes in each BinaryArray entry to native-endian
+/// halffloat (uint16_t) values
+Status TransferHalfFloat(RecordReader* reader, MemoryPool* pool,
+                         const std::shared_ptr<Field>& field, Datum* out) {
+  auto binary_reader = dynamic_cast<BinaryRecordReader*>(reader);
+  DCHECK(binary_reader);
+  ::arrow::ArrayVector chunks = binary_reader->GetBuilderChunks();

Review Comment:
   Maybe I'm missing something, but I think this should be specifically a fixed size binary array:
   ```suggestion
   /// 1. Creating an arrow::FixedSizeBinaryArray from the RecordReader's builder
   /// 2. Allocating a buffer for the arrow::HalfFloatArray
   /// 3. Converting the little-endian bytes in each FixedSizeBinaryArray entry to native-endian
   /// halffloat (uint16_t) values
   Status TransferHalfFloat(RecordReader* reader, MemoryPool* pool,
                            const std::shared_ptr<Field>& field, Datum* out) {
     auto* binary_reader = dynamic_cast<FLBARecordReader*>(reader);
     DCHECK(binary_reader);
     ::arrow::ArrayVector chunks = binary_reader->GetBuilderChunks();
   ```



##########
cpp/src/arrow/util/float16.h:
##########
@@ -0,0 +1,198 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <array>
+#include <cstdint>
+#include <cstring>
+#include <iosfwd>
+#include <limits>
+#include <type_traits>
+
+#include "arrow/util/endian.h"
+#include "arrow/util/ubsan.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// \brief Base class for an IEEE half-precision float, encoded as a `uint16_t`
+///
+/// The exact format is as follows (from MSB to LSB):
+/// - bit 0:     sign
+/// - bits 1-5:  exponent
+/// - bits 6-15: mantissa
+///
+/// NOTE: Methods in the class should not mutate the unerlying value or produce copies.

Review Comment:
   ```suggestion
   /// NOTE: Methods in the class should not mutate the underlying value or produce copies.
   ```



##########
cpp/src/arrow/util/float16.h:
##########
@@ -0,0 +1,198 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <array>
+#include <cstdint>
+#include <cstring>
+#include <iosfwd>
+#include <limits>
+#include <type_traits>
+
+#include "arrow/util/endian.h"
+#include "arrow/util/ubsan.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// \brief Base class for an IEEE half-precision float, encoded as a `uint16_t`
+///
+/// The exact format is as follows (from MSB to LSB):
+/// - bit 0:     sign
+/// - bits 1-5:  exponent
+/// - bits 6-15: mantissa
+///
+/// NOTE: Methods in the class should not mutate the unerlying value or produce copies.

Review Comment:
   I'm not sure why this division is beneficial. Is this implementation inspired by another which makes this distinction? If possible, I think it'd be preferable to be simpler and have a single class.
   
   If you're looking at BasicDecimal/Decimal, those classes are kept separate so that Gandiva can include them in generated LLVM IR (so methods which reference std:: types and functions are kept out). Similar support for float16 is out of scope here, I would think.



##########
cpp/src/parquet/arrow/reader_internal.cc:
##########
@@ -713,6 +715,77 @@ Status TransferDecimal(RecordReader* reader, MemoryPool* pool,
   return Status::OK();
 }
 
+static inline Status ConvertToHalfFloat(const Array& array,

Review Comment:
   Could this return Result<std::shared_ptr<Array>> instead?



##########
cpp/src/arrow/util/float16_test.cc:
##########
@@ -0,0 +1,168 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <array>
+#include <utility>
+#include <vector>
+
+#include <gtest/gtest.h>
+
+#include "arrow/testing/gtest_util.h"
+#include "arrow/util/endian.h"
+#include "arrow/util/float16.h"
+#include "arrow/util/ubsan.h"
+
+namespace arrow {
+namespace util {
+namespace {
+
+template <typename T>
+using Limits = std::numeric_limits<T>;
+
+// Holds a float16 and its equivalent float32
+struct TestValue {
+  TestValue(Float16 f16, float f32) : f16(f16), f32(f32) {}
+  TestValue(uint16_t u16, float f32) : TestValue(Float16(u16), f32) {}
+
+  Float16 f16;
+  float f32;
+};
+
+#define GENERATE_OPERATOR(NAME, OP)                              \
+  struct NAME {                                                  \
+    std::pair<bool, bool> operator()(TestValue l, TestValue r) { \
+      return std::make_pair((l.f32 OP r.f32), (l.f16 OP r.f16)); \
+    }                                                            \
+  }
+
+GENERATE_OPERATOR(CompareEq, ==);
+GENERATE_OPERATOR(CompareNe, !=);
+GENERATE_OPERATOR(CompareLt, <);
+GENERATE_OPERATOR(CompareGt, >);
+GENERATE_OPERATOR(CompareLe, <=);
+GENERATE_OPERATOR(CompareGe, >=);
+
+#undef GENERATE_OPERATOR
+
+const std::vector<TestValue> g_test_values = {
+    TestValue(Limits<Float16>::min(), +0.00006104f),
+    TestValue(Limits<Float16>::max(), +65504.0f),
+    TestValue(Limits<Float16>::lowest(), -65504.0f),
+    TestValue(+Limits<Float16>::infinity(), +Limits<float>::infinity()),
+    TestValue(-Limits<Float16>::infinity(), -Limits<float>::infinity()),
+    // Multiple (semantically equivalent) NaN representations
+    TestValue(0x7fff, Limits<float>::quiet_NaN()),
+    TestValue(0xffff, Limits<float>::quiet_NaN()),
+    TestValue(0x7e00, Limits<float>::quiet_NaN()),
+    TestValue(0xfe00, Limits<float>::quiet_NaN()),
+    // Positive/negative zeroes
+    TestValue(0x0000, +0.0f),
+    TestValue(0x8000, -0.0f),
+    // Miscellaneous values. In general, they're chosen to test the sign/exponent and
+    // exponent/mantissa boundaries
+    TestValue(0x101c, +0.000502f),
+    TestValue(0x901c, -0.000502f),
+    TestValue(0x101d, +0.0005022f),
+    TestValue(0x901d, -0.0005022f),
+    TestValue(0x121c, +0.000746f),
+    TestValue(0x921c, -0.000746f),
+    TestValue(0x141c, +0.001004f),
+    TestValue(0x941c, -0.001004f),
+    TestValue(0x501c, +32.9f),
+    TestValue(0xd01c, -32.9f),
+    // A few subnormals for good measure
+    TestValue(0x001c, +0.0000017f),
+    TestValue(0x801c, -0.0000017f),
+    TestValue(0x021c, +0.0000332f),
+    TestValue(0x821c, -0.0000332f),
+};
+
+template <typename Operator>
+class Float16OperatorTest : public ::testing::Test {
+ public:
+  void TestCompare(const std::vector<TestValue>& test_values) {
+    const auto num_values = static_cast<int>(test_values.size());
+
+    // Check all combinations of operands in both directions
+    for (int i = 0; i < num_values; ++i) {
+      for (int j = 0; j < num_values; ++j) {
+        ARROW_SCOPED_TRACE(i, ",", j);
+
+        auto a = test_values[i];
+        auto b = test_values[j];
+
+        // Results for float16 and float32 should be the same
+        auto ret = Operator{}(a, b);
+        ASSERT_EQ(ret.first, ret.second);
+      }
+    }
+  }
+};
+
+using OperatorTypes =
+    ::testing::Types<CompareEq, CompareNe, CompareLt, CompareGt, CompareLe, CompareGe>;
+
+TYPED_TEST_SUITE(Float16OperatorTest, OperatorTypes);
+
+TYPED_TEST(Float16OperatorTest, Compare) { this->TestCompare(g_test_values); }
+

Review Comment:
   Instead of going through TYPED_TEST etc, I think it'd be more readable to use:
   
   ```c++
   TEST(Float16Test, Compare) {
     auto ExpectOp = [](std::string op_name, auto op) {
       ARROW_SCOPED_TRACE(op_name);
       const auto num_values = static_cast<int>(g_test_values.size());
   
       // Check all combinations of operands in both directions
       for (int i = 0; i < num_values; ++i) {
         for (int j = 0; j < num_values; ++j) {
           ARROW_SCOPED_TRACE(i, ",", j);
   
           auto [a16, a32] = test_values[i];
           auto [b16, b32] = test_values[j];
   
           // Results for float16 and float32 should be the same
           ASSERT_EQ(op(a16, b16), op(a32, b32));
         }
       }
     };
   
     ExpectOp("equal", [](auto l, auto r) { return l == r; });
     // ...
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #36073: GH-36036: [C++][Python][Parquet] Implement Float16 logical type

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1290844142


##########
cpp/src/parquet/statistics.cc:
##########
@@ -527,9 +616,28 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
 
   void IncrementNumValues(int64_t n) override { num_values_ += n; }
 
+  static bool IsMeaningfulLogicalType(LogicalType::Type::type type) {

Review Comment:
   Thanks for you explanation!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] benibus commented on a diff in pull request #36073: GH-36036: [C++][Python][Parquet] Implement Float16 logical type

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1289036678


##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -1082,6 +1115,45 @@ void TestStatisticsSortOrder<FLBAType>::SetValues() {
       .set_max(std::string(reinterpret_cast<const char*>(&vals[8][0]), FLBA_LENGTH));
 }
 
+template <>
+void TestStatisticsSortOrder<Float16LogicalType>::AddNodes(std::string name) {
+  auto node =
+      schema::PrimitiveNode::Make(name, Repetition::REQUIRED, LogicalType::Float16(),
+                                  Type::FIXED_LEN_BYTE_ARRAY, sizeof(uint16_t));
+  fields_.push_back(std::move(node));
+}
+
+template <>
+void TestStatisticsSortOrder<Float16LogicalType>::SetValues() {
+  constexpr int kValueLen = 2;
+  constexpr int kNumBytes = NUM_VALUES * kValueLen;
+
+  const uint16_t u16_vals[NUM_VALUES] = {
+      0b1100010100000000,  // -5.0
+      0b1100010000000000,  // -4.0
+      0b1100001000000000,  // -3.0
+      0b1100000000000000,  // -2.0
+      0b1011110000000000,  // -1.0
+      0b0000000000000000,  // +0.0
+      0b0011110000000000,  // +1.0
+      0b0100000000000000,  // +2.0
+      0b0100001000000000,  // +3.0
+      0b0100010000000000,  // +4.0
+  };
+
+  values_buf_.resize(kNumBytes);
+  uint8_t* ptr = values_buf_.data();
+  for (int i = 0; i < NUM_VALUES; ++i) {
+    Float16(u16_vals[i]).ToLittleEndian(ptr);
+    values_[i].ptr = ptr;
+    ptr += kValueLen;
+  }
+
+  stats_[0]
+      .set_min(std::string(reinterpret_cast<const char*>(values_[0].ptr), kValueLen))

Review Comment:
   That's correct.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] anjakefala commented on pull request #36073: GH-36036: [C++][Python][Parquet] Implement Float16 logical type

Posted by "anjakefala (via GitHub)" <gi...@apache.org>.
anjakefala commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1720114857

   @zhangjiashen Please link to the PR here when it is available! =)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #36073: GH-36036: [C++][Parquet] Implement Float16 logical type

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1232024398


##########
cpp/src/parquet/float_internal.h:
##########
@@ -0,0 +1,61 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <cstdint>
+#include <cstring>
+
+#include "arrow/util/bit_util.h"
+#include "arrow/util/ubsan.h"
+#include "parquet/types.h"
+
+namespace parquet {
+
+struct float16 {

Review Comment:
   Comment or link for float16 specs would be helpful.



##########
cpp/src/parquet/statistics.cc:
##########
@@ -277,11 +278,54 @@ template <bool is_signed>
 struct CompareHelper<FLBAType, is_signed>
     : public BinaryLikeCompareHelperBase<FLBAType, is_signed> {};
 
+struct Float16CompareHelper {
+  using T = FLBA;
+
+  static T DefaultMin() { return T{float16::max_ptr()}; }
+  static T DefaultMax() { return T{float16::min_ptr()}; }

Review Comment:
   ```suggestion
     static T DefaultMin() { return T{float16::min_ptr()}; }
     static T DefaultMax() { return T{float16::max_ptr()}; }
   ```



##########
cpp/src/parquet/types.cc:
##########
@@ -435,6 +435,8 @@ std::shared_ptr<const LogicalType> LogicalType::FromThrift(
     return BSONLogicalType::Make();
   } else if (type.__isset.UUID) {
     return UUIDLogicalType::Make();
+  } else if (type.__isset.FLOAT16) {

Review Comment:
   We may need to change thrift_internal.h for serialization of FLOAT16.



##########
cpp/src/parquet/float_internal.h:
##########
@@ -0,0 +1,61 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <cstdint>
+#include <cstring>
+
+#include "arrow/util/bit_util.h"
+#include "arrow/util/ubsan.h"
+#include "parquet/types.h"
+
+namespace parquet {
+
+struct float16 {

Review Comment:
   +1. I would expect Float16 to have similar capability of Decimal128/Decimal256.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] benibus commented on pull request #36073: GH-36036: [C++][Parquet] Implement Float16 logical type

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1597601039

   > Do you plan to tackle Arrow integration in a subsequent PR? This PR cannot be integrated before the proposal is approved, AFAICT.
   
   Yes, my plan _was_ to handle Arrow integration in a subsequent PR (in tandem with the Java implementation), and then move forward with the proposal.
   
   However, I think you're right that we'll need to roll those features into this PR. Just did a closer re-reading of @julienledem's comments on https://github.com/apache/parquet-format/pull/184, and I believe we're going to want the full C++/Java PRs ready (but **not** merged) before approving the proposal - so we'll probably need to sit on this one in the meantime.
   
   That being said, I can maintain this PR until then. Plus we may need to make adjustments if new requirements come in from the Java side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #36073: GH-36036: [C++][Parquet] Implement Float16 logical type

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1230337040


##########
cpp/src/parquet/float_internal.h:
##########
@@ -0,0 +1,61 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <cstdint>
+#include <cstring>
+
+#include "arrow/util/bit_util.h"
+#include "arrow/util/ubsan.h"
+#include "parquet/types.h"
+
+namespace parquet {
+
+struct float16 {
+  constexpr static uint16_t min() { return 0b1111101111111111; }
+  constexpr static uint16_t max() { return 0b0111101111111111; }

Review Comment:
   should it follow the `min` `max` lowest` ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #36073: GH-36036: [C++][Parquet] Implement Float16 logical type

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1600976636

   Just curious: Is it enough to get approval if there is a full C++ implementation only? Or java parity should be ready at the meanwhile?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] benibus commented on pull request #36073: GH-36036: [C++][Parquet] Implement Float16 logical type

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1631112527

   @pitrou The most recent commits handle Arrow integration if you want to take another look at some point.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] benibus commented on a diff in pull request #36073: GH-36036: [C++][Python][Parquet] Implement Float16 logical type

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1333518630


##########
cpp/src/arrow/util/float16.h:
##########
@@ -0,0 +1,207 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <array>
+#include <cstdint>
+#include <cstring>
+#include <iosfwd>
+#include <limits>
+#include <type_traits>
+
+#include "arrow/util/endian.h"
+#include "arrow/util/macros.h"
+#include "arrow/util/ubsan.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// \brief Class representing an IEEE half-precision float, encoded as a `uint16_t`
+///
+/// The exact format is as follows (from LSB to MSB):
+/// - bits 0-10:  mantissa
+/// - bits 10-15: exponent
+/// - bit 15:     sign
+///
+class ARROW_EXPORT Float16 {
+ public:
+  Float16() = default;
+  constexpr explicit Float16(uint16_t value) : value_(value) {}

Review Comment:
   Since the class now supports construction from `float`/`double`, it _seems_ like a bad idea for this to be a public constructor as well (an integer won't implicitly convert to floating point like one might expect) - so I might change some things around. However, LMK if there are any suggestions for the API in general.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] benibus commented on a diff in pull request #36073: GH-36036: [C++][Python][Parquet] Implement Float16 logical type

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on code in PR #36073:
URL: https://github.com/apache/arrow/pull/36073#discussion_r1333492541


##########
cpp/src/arrow/util/float16.h:
##########
@@ -0,0 +1,207 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <array>
+#include <cstdint>
+#include <cstring>
+#include <iosfwd>
+#include <limits>
+#include <type_traits>
+
+#include "arrow/util/endian.h"
+#include "arrow/util/macros.h"
+#include "arrow/util/ubsan.h"
+#include "arrow/util/visibility.h"
+
+namespace arrow {
+namespace util {
+
+/// \brief Class representing an IEEE half-precision float, encoded as a `uint16_t`
+///
+/// The exact format is as follows (from LSB to MSB):
+/// - bits 0-10:  mantissa
+/// - bits 10-15: exponent
+/// - bit 15:     sign
+///
+class ARROW_EXPORT Float16 {
+ public:
+  Float16() = default;
+  constexpr explicit Float16(uint16_t value) : value_(value) {}
+
+  template <typename T, typename std::enable_if_t<std::is_floating_point_v<T>>* = NULLPTR>
+  explicit Float16(T f) : Float16(FromNative(f)) {}
+
+  /// \brief Create a `Float16` from a 32-bit float (may lose precision)
+  static Float16 FromFloat(float f);
+  /// \brief Create a `Float16` from a 64-bit float (may lose precision)
+  static Float16 FromDouble(double d);
+  /// \brief Create a `Float16` from a native floating-point value (may lose precision)
+  static Float16 FromNative(float f) { return FromFloat(f); }
+  static Float16 FromNative(double d) { return FromDouble(d); }
+
+  /// \brief Read a `Float16` from memory in native-endian byte order
+  static Float16 FromBytes(const uint8_t* src) {
+    return Float16(SafeLoadAs<uint16_t>(src));
+  }
+
+  /// \brief Read a `Float16` from memory in little-endian byte order
+  static Float16 FromLittleEndian(const uint8_t* src) {
+    return Float16(::arrow::bit_util::FromLittleEndian(SafeLoadAs<uint16_t>(src)));
+  }
+
+  /// \brief Read a `Float16` from memory in big-endian byte order
+  static Float16 FromBigEndian(const uint8_t* src) {
+    return Float16(::arrow::bit_util::FromBigEndian(SafeLoadAs<uint16_t>(src)));
+  }
+
+  /// \brief Return the value's integer representation
+  constexpr uint16_t bits() const { return value_; }
+
+  /// \brief Return true if the value is negative (sign bit is set)
+  constexpr bool signbit() const { return (value_ & 0x8000) != 0; }
+
+  /// \brief Return true if the value is NaN
+  constexpr bool is_nan() const {
+    return (value_ & 0x7c00) == 0x7c00 && (value_ & 0x03ff) != 0;
+  }
+  /// \brief Return true if the value is positive/negative infinity
+  constexpr bool is_infinity() const { return (value_ & 0x7fff) == 0x7c00; }
+  /// \brief Return true if the value is finite and not NaN
+  constexpr bool is_finite() const { return (value_ & 0x7c00) != 0x7c00; }
+  /// \brief Return true if the value is positive/negative zero
+  constexpr bool is_zero() const { return (value_ & 0x7fff) == 0; }
+
+  /// \brief Convert to a 32-bit float
+  float ToFloat() const;
+  /// \brief Convert to a 64-bit float
+  double ToDouble() const;
+
+  explicit operator float() const { return ToFloat(); }
+  explicit operator double() const { return ToDouble(); }
+
+  /// \brief Copy the value's bytes in native-endian byte order
+  void ToBytes(uint8_t* dest) const { std::memcpy(dest, &value_, sizeof(value_)); }
+  /// \brief Return the value's bytes in native-endian byte order
+  constexpr std::array<uint8_t, 2> ToBytes() const {
+#if ARROW_LITTLE_ENDIAN
+    return ToLittleEndian();
+#else
+    return ToBigEndian();
+#endif
+  }
+
+  /// \brief Copy the value's bytes in little-endian byte order
+  void ToLittleEndian(uint8_t* dest) const {
+    Float16{::arrow::bit_util::ToLittleEndian(value_)}.ToBytes(dest);
+  }
+  /// \brief Return the value's bytes in little-endian byte order
+  constexpr std::array<uint8_t, 2> ToLittleEndian() const {
+#if ARROW_LITTLE_ENDIAN
+    return {uint8_t(value_ & 0xff), uint8_t(value_ >> 8)};
+#else
+    return {uint8_t(value_ >> 8), uint8_t(value_ & 0xff)};
+#endif
+  }
+
+  /// \brief Copy the value's bytes in big-endian byte order
+  void ToBigEndian(uint8_t* dest) const {
+    Float16{::arrow::bit_util::ToBigEndian(value_)}.ToBytes(dest);
+  }
+  /// \brief Return the value's bytes in big-endian byte order
+  constexpr std::array<uint8_t, 2> ToBigEndian() const {
+#if ARROW_LITTLE_ENDIAN
+    return {uint8_t(value_ >> 8), uint8_t(value_ & 0xff)};
+#else
+    return {uint8_t(value_ & 0xff), uint8_t(value_ >> 8)};
+#endif
+  }
+
+  constexpr Float16 operator-() const { return Float16(value_ ^ 0x8000); }
+  constexpr Float16 operator+() const { return Float16(value_); }
+
+  friend constexpr bool operator==(Float16 lhs, Float16 rhs) {
+    if (lhs.is_nan() || rhs.is_nan()) return false;
+    return Float16::CompareEq(lhs, rhs);
+  }
+  friend constexpr bool operator!=(Float16 lhs, Float16 rhs) { return !(lhs == rhs); }
+
+  friend constexpr bool operator<(Float16 lhs, Float16 rhs) {
+    if (lhs.is_nan() || rhs.is_nan()) return false;
+    return Float16::CompareLt(lhs, rhs);
+  }
+  friend constexpr bool operator>(Float16 lhs, Float16 rhs) { return rhs < lhs; }
+
+  friend constexpr bool operator<=(Float16 lhs, Float16 rhs) {
+    if (lhs.is_nan() || rhs.is_nan()) return false;
+    return !Float16::CompareLt(rhs, lhs);
+  }
+  friend constexpr bool operator>=(Float16 lhs, Float16 rhs) { return rhs <= lhs; }
+
+  ARROW_FRIEND_EXPORT friend std::ostream& operator<<(std::ostream& os, Float16 arg);
+
+ protected:
+  uint16_t value_;

Review Comment:
   Hmm... I could go either way, but I'm leaning towards keeping it as is - mostly because it mirrors the properties of the native FP types, which seems to be something we're (at least implicitly) aspiring to do here. Plus, if we decide that the compiler/stl optimizations aren't worth the risk, then we could add the default initializer later without as much consequence as the reverse.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-36036: [C++][Python][Parquet] Implement Float16 logical type [arrow]

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on PR #36073:
URL: https://github.com/apache/arrow/pull/36073#issuecomment-1810834181

   > Would it be useful to add tests checking that one write or read a Parquet FLOAT16 with a byte length != 2?
   
   Probably... I'll try to add something, assuming it wouldn't be redundant.
   
   > ( It's a bit weird for me that fp16 can only use PLAIN, RLE_DICT and ... DELTA_BYTE_ARRAY, aha...)
   
   Yeah, I think we're going to specifically address the encodings as a follow-up since there's been some recent discussion in that area.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-36036: [C++][Python][Parquet] Implement Float16 logical type [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou merged PR #36073:
URL: https://github.com/apache/arrow/pull/36073


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org