You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "wgtmac (via GitHub)" <gi...@apache.org> on 2023/06/01 09:48:22 UTC

[GitHub] [arrow] wgtmac commented on a diff in pull request #35825: GH-32723: [C++][Parquet] Add option to use LARGE* variants of binary types

wgtmac commented on code in PR #35825:
URL: https://github.com/apache/arrow/pull/35825#discussion_r1212865493


##########
cpp/src/parquet/properties.h:
##########
@@ -116,6 +116,12 @@ class PARQUET_EXPORT ReaderProperties {
     page_checksum_verification_ = check_crc;
   }
 
+  bool use_binary_large_variants() const { return use_binary_large_variants_; }

Review Comment:
   It would be moved to `ArrowReaderProperties` to advise the return type of arrow binary arrays. 
   
   To be clear, `ReaderProperties` controls the behavior the internal parquet reader (without any arrow dependency) while `ArrowReaderProperties` controls the arrow-related behaviors around parquet.



##########
cpp/src/arrow/array/builder_dict.h:
##########
@@ -724,6 +724,7 @@ using BinaryDictionaryBuilder = DictionaryBuilder<BinaryType>;
 using StringDictionaryBuilder = DictionaryBuilder<StringType>;
 using BinaryDictionary32Builder = Dictionary32Builder<BinaryType>;
 using StringDictionary32Builder = Dictionary32Builder<StringType>;
+using LargeBinaryDictionary32Builder = Dictionary32Builder<LargeBinaryType>;

Review Comment:
   Does `LargeStringDictionary32Builder` worth being added as well?



##########
cpp/src/parquet/types.h:
##########
@@ -64,8 +64,12 @@ struct Type {
     DOUBLE = 5,
     BYTE_ARRAY = 6,
     FIXED_LEN_BYTE_ARRAY = 7,
+
+    // This parquet type does not actually exist (AFAIK) and is used to
+    // create proper type traits
+    LARGE_BYTE_ARRAY = 8,

Review Comment:
   I don't think this is right and the parquet internal physical type should be left unchanged.
   
   We should always derive the target arrow binary type from the arrow reader/writer properties.



##########
cpp/src/parquet/arrow/schema_internal.cc:
##########
@@ -200,7 +217,7 @@ Result<std::shared_ptr<ArrowType>> GetArrowType(
     case ParquetType::DOUBLE:
       return ::arrow::float64();
     case ParquetType::BYTE_ARRAY:
-      return FromByteArray(logical_type);
+      return use_binary_large_variant ? FromLargeByteArray(logical_type) : FromByteArray(logical_type);

Review Comment:
   You may directly pass `use_binary_large_variant` to `FromByteArray` and avoid duplicate code.



##########
cpp/src/parquet/types.h:
##########
@@ -588,6 +592,26 @@ inline bool operator!=(const ByteArray& left, const ByteArray& right) {
   return !(left == right);
 }
 
+struct LargeByteArray {
+  LargeByteArray() : len(0), ptr(NULLPTR) {}
+  LargeByteArray(uint64_t len, const uint8_t* ptr) : len(len), ptr(ptr) {}

Review Comment:
   Ditto. 
   
   All the binary values cannot exceed a length of INT32_MAX in the parquet specs, so reusing `ByteArray` is sufficient.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org