You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/09/08 14:41:04 UTC

[GitHub] [arrow] pitrou opened a new pull request #8136: ARROW-9078: [C++] Parquet read / write extension type with nested storage type

pitrou opened a new pull request #8136:
URL: https://github.com/apache/arrow/pull/8136


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #8136: ARROW-9078: [C++] Parquet read / write extension type with nested storage type

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #8136:
URL: https://github.com/apache/arrow/pull/8136#discussion_r486435150



##########
File path: cpp/src/parquet/arrow/schema.cc
##########
@@ -682,48 +689,78 @@ Status GetOriginSchema(const std::shared_ptr<const KeyValueMetadata>& metadata,
 // Restore original Arrow field information that was serialized as Parquet metadata
 // but that is not necessarily present in the field reconstitued from Parquet data
 // (for example, Parquet timestamp types doesn't carry timezone information).
-Status ApplyOriginalMetadata(std::shared_ptr<Field> field, const Field& origin_field,
-                             std::shared_ptr<Field>* out) {
+
+Status ApplyOriginalStorageMetadata(const Field& origin_field, SchemaField* inferred) {
   auto origin_type = origin_field.type();
-  if (field->type()->id() == ::arrow::Type::TIMESTAMP) {
+  auto inferred_type = inferred->field->type();
+
+  if (inferred_type->id() == ::arrow::Type::TIMESTAMP) {
     // Restore time zone, if any
-    const auto& ts_type = static_cast<const ::arrow::TimestampType&>(*field->type());
+    const auto& ts_type = static_cast<const ::arrow::TimestampType&>(*inferred_type);
     const auto& ts_origin_type = static_cast<const ::arrow::TimestampType&>(*origin_type);
 
     // If the unit is the same and the data is tz-aware, then set the original
     // time zone, since Parquet has no native storage for timezones
     if (ts_type.unit() == ts_origin_type.unit() && ts_type.timezone() == "UTC" &&
         ts_origin_type.timezone() != "") {
-      field = field->WithType(origin_type);
+      inferred->field = inferred->field->WithType(origin_type);
     }
   }
+
   if (origin_type->id() == ::arrow::Type::DICTIONARY &&
-      field->type()->id() != ::arrow::Type::DICTIONARY &&
-      IsDictionaryReadSupported(*field->type())) {
+      inferred_type->id() != ::arrow::Type::DICTIONARY &&
+      IsDictionaryReadSupported(*inferred_type)) {
     const auto& dict_origin_type =
         static_cast<const ::arrow::DictionaryType&>(*origin_type);
-    field = field->WithType(
-        ::arrow::dictionary(::arrow::int32(), field->type(), dict_origin_type.ordered()));
+    inferred->field = inferred->field->WithType(
+        ::arrow::dictionary(::arrow::int32(), inferred_type, dict_origin_type.ordered()));
+  }
+
+  // Restore field metadata
+  std::shared_ptr<const KeyValueMetadata> field_metadata = origin_field.metadata();
+  if (field_metadata != nullptr) {
+    if (inferred->field->metadata()) {
+      // Prefer the metadata keys (like field_id) from the current metadata
+      field_metadata = field_metadata->Merge(*inferred->field->metadata());
+    }
+    inferred->field = inferred->field->WithMetadata(field_metadata);
   }
 
   if (origin_type->id() == ::arrow::Type::EXTENSION) {
     // Restore extension type, if the storage type is as read from Parquet
     const auto& ex_type = checked_cast<const ::arrow::ExtensionType&>(*origin_type);
-    if (ex_type.storage_type()->Equals(*field->type())) {
-      field = field->WithType(origin_type);
+    if (ex_type.storage_type()->Equals(*inferred_type)) {
+      inferred->field = inferred->field->WithType(origin_type);
     }
   }
 
-  // Restore field metadata
-  std::shared_ptr<const KeyValueMetadata> field_metadata = origin_field.metadata();
-  if (field_metadata != nullptr) {
-    if (field->metadata()) {
-      // Prefer the metadata keys (like field_id) from the current metadata
-      field_metadata = field_metadata->Merge(*field->metadata());
+  // TODO Should apply metadata recursively, but for that we need to move metadata
+  // application inside NodeToSchemaField (ARROW-9943)
+
+  return Status::OK();
+}
+
+Status ApplyOriginalMetadata(const Field& origin_field, SchemaField* inferred) {
+  auto origin_type = origin_field.type();
+  auto inferred_type = inferred->field->type();
+
+  if (origin_type->id() == ::arrow::Type::EXTENSION) {
+    const auto& ex_type = checked_cast<const ::arrow::ExtensionType&>(*origin_type);
+    auto origin_storage_field = origin_field.WithType(ex_type.storage_type());
+
+    // Apply metadata recursively to storage type

Review comment:
       this seems to conflict with the TODO above.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8136: ARROW-9078: [C++] Parquet read / write extension type with nested storage type

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8136:
URL: https://github.com/apache/arrow/pull/8136#issuecomment-696867693


   Will merge.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield edited a comment on pull request #8136: ARROW-9078: [C++] Parquet read / write extension type with nested storage type

Posted by GitBox <gi...@apache.org>.
emkornfield edited a comment on pull request #8136:
URL: https://github.com/apache/arrow/pull/8136#issuecomment-690411317


   > One gotcha I ran into while using a fixed size list type (nested within another list) that in this case the extension type wasn't preserved. But I suppose this is because the fixed size list is not yet preserved on parquet roundtrip (read back as normal list type), and to add back the extension type, the storage type needs to be the same. So in hindsight, this behaviour is as expected (as long as fixed size list is not yet supported).
   
   @jorisvandenbossche Reading fixed size lists is not yet supported.  One use-case that will likely not be supported in the short term is reading parquet files with FixedSizeLists that have null elements.  Do your use-cases with FixedSizeList allow for null lists?  I would also add that writing FixedSizeLists with null elements only works under a very limited of circumstances today as it is.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8136: ARROW-9078: [C++] Parquet read / write extension type with nested storage type

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8136:
URL: https://github.com/apache/arrow/pull/8136#discussion_r487921544



##########
File path: cpp/src/parquet/arrow/schema.cc
##########
@@ -682,48 +689,78 @@ Status GetOriginSchema(const std::shared_ptr<const KeyValueMetadata>& metadata,
 // Restore original Arrow field information that was serialized as Parquet metadata
 // but that is not necessarily present in the field reconstitued from Parquet data
 // (for example, Parquet timestamp types doesn't carry timezone information).
-Status ApplyOriginalMetadata(std::shared_ptr<Field> field, const Field& origin_field,
-                             std::shared_ptr<Field>* out) {
+
+Status ApplyOriginalStorageMetadata(const Field& origin_field, SchemaField* inferred) {
   auto origin_type = origin_field.type();
-  if (field->type()->id() == ::arrow::Type::TIMESTAMP) {
+  auto inferred_type = inferred->field->type();
+
+  if (inferred_type->id() == ::arrow::Type::TIMESTAMP) {
     // Restore time zone, if any
-    const auto& ts_type = static_cast<const ::arrow::TimestampType&>(*field->type());
+    const auto& ts_type = static_cast<const ::arrow::TimestampType&>(*inferred_type);
     const auto& ts_origin_type = static_cast<const ::arrow::TimestampType&>(*origin_type);
 
     // If the unit is the same and the data is tz-aware, then set the original
     // time zone, since Parquet has no native storage for timezones
     if (ts_type.unit() == ts_origin_type.unit() && ts_type.timezone() == "UTC" &&
         ts_origin_type.timezone() != "") {
-      field = field->WithType(origin_type);
+      inferred->field = inferred->field->WithType(origin_type);
     }
   }
+
   if (origin_type->id() == ::arrow::Type::DICTIONARY &&
-      field->type()->id() != ::arrow::Type::DICTIONARY &&
-      IsDictionaryReadSupported(*field->type())) {
+      inferred_type->id() != ::arrow::Type::DICTIONARY &&
+      IsDictionaryReadSupported(*inferred_type)) {
     const auto& dict_origin_type =
         static_cast<const ::arrow::DictionaryType&>(*origin_type);
-    field = field->WithType(
-        ::arrow::dictionary(::arrow::int32(), field->type(), dict_origin_type.ordered()));
+    inferred->field = inferred->field->WithType(
+        ::arrow::dictionary(::arrow::int32(), inferred_type, dict_origin_type.ordered()));
+  }
+
+  // Restore field metadata
+  std::shared_ptr<const KeyValueMetadata> field_metadata = origin_field.metadata();
+  if (field_metadata != nullptr) {
+    if (inferred->field->metadata()) {
+      // Prefer the metadata keys (like field_id) from the current metadata
+      field_metadata = field_metadata->Merge(*inferred->field->metadata());
+    }
+    inferred->field = inferred->field->WithMetadata(field_metadata);
   }
 
   if (origin_type->id() == ::arrow::Type::EXTENSION) {
     // Restore extension type, if the storage type is as read from Parquet
     const auto& ex_type = checked_cast<const ::arrow::ExtensionType&>(*origin_type);
-    if (ex_type.storage_type()->Equals(*field->type())) {
-      field = field->WithType(origin_type);
+    if (ex_type.storage_type()->Equals(*inferred_type)) {
+      inferred->field = inferred->field->WithType(origin_type);
     }
   }
 
-  // Restore field metadata
-  std::shared_ptr<const KeyValueMetadata> field_metadata = origin_field.metadata();
-  if (field_metadata != nullptr) {
-    if (field->metadata()) {
-      // Prefer the metadata keys (like field_id) from the current metadata
-      field_metadata = field_metadata->Merge(*field->metadata());
+  // TODO Should apply metadata recursively, but for that we need to move metadata
+  // application inside NodeToSchemaField (ARROW-9943)
+
+  return Status::OK();
+}
+
+Status ApplyOriginalMetadata(const Field& origin_field, SchemaField* inferred) {
+  auto origin_type = origin_field.type();
+  auto inferred_type = inferred->field->type();
+
+  if (origin_type->id() == ::arrow::Type::EXTENSION) {
+    const auto& ex_type = checked_cast<const ::arrow::ExtensionType&>(*origin_type);
+    auto origin_storage_field = origin_field.WithType(ex_type.storage_type());
+
+    // Apply metadata recursively to storage type

Review comment:
       I should make the TODO more precise then, it is about applying to children types.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #8136: ARROW-9078: [C++] Parquet read / write extension type with nested storage type

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #8136:
URL: https://github.com/apache/arrow/pull/8136


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #8136: ARROW-9078: [C++] Parquet read / write extension type with nested storage type

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #8136:
URL: https://github.com/apache/arrow/pull/8136#discussion_r486437260



##########
File path: cpp/src/parquet/arrow/writer.cc
##########
@@ -63,6 +66,10 @@ using arrow::Status;
 using arrow::Table;
 using arrow::TimeUnit;
 
+using ArrowTypeId = arrow::Type;

Review comment:
       nit: this actually makes the code less understandable to me and it seems like it only saves 1 character?  (not a big deal either way though).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8136: ARROW-9078: [C++] Parquet read / write extension type with nested storage type

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8136:
URL: https://github.com/apache/arrow/pull/8136#discussion_r485028923



##########
File path: cpp/src/arrow/extension_type.h
##########
@@ -80,6 +81,12 @@ class ARROW_EXPORT ExtensionType : public DataType {
   /// \return the serialized representation
   virtual std::string Serialize() const = 0;
 
+  static std::shared_ptr<Array> WrapArray(const std::shared_ptr<DataType>& ext_type,
+                                          const std::shared_ptr<Array>& storage);
+  static std::shared_ptr<ChunkedArray> WrapArray(
+      const std::shared_ptr<DataType>& ext_type,
+      const std::shared_ptr<ChunkedArray>& storage);

Review comment:
       Right :-)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8136: ARROW-9078: [C++] Parquet read / write extension type with nested storage type

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8136:
URL: https://github.com/apache/arrow/pull/8136#discussion_r487920623



##########
File path: cpp/src/parquet/arrow/schema.cc
##########
@@ -682,48 +689,78 @@ Status GetOriginSchema(const std::shared_ptr<const KeyValueMetadata>& metadata,
 // Restore original Arrow field information that was serialized as Parquet metadata
 // but that is not necessarily present in the field reconstitued from Parquet data
 // (for example, Parquet timestamp types doesn't carry timezone information).
-Status ApplyOriginalMetadata(std::shared_ptr<Field> field, const Field& origin_field,
-                             std::shared_ptr<Field>* out) {
+
+Status ApplyOriginalStorageMetadata(const Field& origin_field, SchemaField* inferred) {
   auto origin_type = origin_field.type();
-  if (field->type()->id() == ::arrow::Type::TIMESTAMP) {
+  auto inferred_type = inferred->field->type();
+
+  if (inferred_type->id() == ::arrow::Type::TIMESTAMP) {
     // Restore time zone, if any
-    const auto& ts_type = static_cast<const ::arrow::TimestampType&>(*field->type());
+    const auto& ts_type = static_cast<const ::arrow::TimestampType&>(*inferred_type);
     const auto& ts_origin_type = static_cast<const ::arrow::TimestampType&>(*origin_type);
 
     // If the unit is the same and the data is tz-aware, then set the original
     // time zone, since Parquet has no native storage for timezones
     if (ts_type.unit() == ts_origin_type.unit() && ts_type.timezone() == "UTC" &&
         ts_origin_type.timezone() != "") {
-      field = field->WithType(origin_type);
+      inferred->field = inferred->field->WithType(origin_type);
     }
   }
+
   if (origin_type->id() == ::arrow::Type::DICTIONARY &&
-      field->type()->id() != ::arrow::Type::DICTIONARY &&
-      IsDictionaryReadSupported(*field->type())) {
+      inferred_type->id() != ::arrow::Type::DICTIONARY &&
+      IsDictionaryReadSupported(*inferred_type)) {
     const auto& dict_origin_type =
         static_cast<const ::arrow::DictionaryType&>(*origin_type);
-    field = field->WithType(
-        ::arrow::dictionary(::arrow::int32(), field->type(), dict_origin_type.ordered()));
+    inferred->field = inferred->field->WithType(
+        ::arrow::dictionary(::arrow::int32(), inferred_type, dict_origin_type.ordered()));
+  }
+
+  // Restore field metadata
+  std::shared_ptr<const KeyValueMetadata> field_metadata = origin_field.metadata();
+  if (field_metadata != nullptr) {
+    if (inferred->field->metadata()) {
+      // Prefer the metadata keys (like field_id) from the current metadata
+      field_metadata = field_metadata->Merge(*inferred->field->metadata());
+    }
+    inferred->field = inferred->field->WithMetadata(field_metadata);
   }
 
   if (origin_type->id() == ::arrow::Type::EXTENSION) {
     // Restore extension type, if the storage type is as read from Parquet
     const auto& ex_type = checked_cast<const ::arrow::ExtensionType&>(*origin_type);
-    if (ex_type.storage_type()->Equals(*field->type())) {
-      field = field->WithType(origin_type);
+    if (ex_type.storage_type()->Equals(*inferred_type)) {
+      inferred->field = inferred->field->WithType(origin_type);
     }
   }
 
-  // Restore field metadata
-  std::shared_ptr<const KeyValueMetadata> field_metadata = origin_field.metadata();
-  if (field_metadata != nullptr) {
-    if (field->metadata()) {
-      // Prefer the metadata keys (like field_id) from the current metadata
-      field_metadata = field_metadata->Merge(*field->metadata());
+  // TODO Should apply metadata recursively, but for that we need to move metadata
+  // application inside NodeToSchemaField (ARROW-9943)
+
+  return Status::OK();
+}
+
+Status ApplyOriginalMetadata(const Field& origin_field, SchemaField* inferred) {
+  auto origin_type = origin_field.type();
+  auto inferred_type = inferred->field->type();
+
+  if (origin_type->id() == ::arrow::Type::EXTENSION) {
+    const auto& ex_type = checked_cast<const ::arrow::ExtensionType&>(*origin_type);
+    auto origin_storage_field = origin_field.WithType(ex_type.storage_type());
+
+    // Apply metadata recursively to storage type
+    RETURN_NOT_OK(ApplyOriginalStorageMetadata(*origin_storage_field, inferred));
+    inferred->storage_field = inferred->field;
+
+    // Restore extension type, if the storage type is as read from Parquet

Review comment:
       Hmm... right. Would "as inferred from Parquet" be clearer?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #8136: ARROW-9078: [C++] Parquet read / write extension type with nested storage type

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #8136:
URL: https://github.com/apache/arrow/pull/8136#discussion_r485027960



##########
File path: cpp/src/arrow/extension_type.h
##########
@@ -80,6 +81,12 @@ class ARROW_EXPORT ExtensionType : public DataType {
   /// \return the serialized representation
   virtual std::string Serialize() const = 0;
 
+  static std::shared_ptr<Array> WrapArray(const std::shared_ptr<DataType>& ext_type,
+                                          const std::shared_ptr<Array>& storage);
+  static std::shared_ptr<ChunkedArray> WrapArray(
+      const std::shared_ptr<DataType>& ext_type,
+      const std::shared_ptr<ChunkedArray>& storage);

Review comment:
       Seems OK to me. May want doxygen comments




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8136: ARROW-9078: [C++] Parquet read / write extension type with nested storage type

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8136:
URL: https://github.com/apache/arrow/pull/8136#issuecomment-688934268


   https://issues.apache.org/jira/browse/ARROW-9078


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8136: ARROW-9078: [C++] Parquet read / write extension type with nested storage type

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8136:
URL: https://github.com/apache/arrow/pull/8136#discussion_r484974652



##########
File path: cpp/src/arrow/extension_type.h
##########
@@ -80,6 +81,12 @@ class ARROW_EXPORT ExtensionType : public DataType {
   /// \return the serialized representation
   virtual std::string Serialize() const = 0;
 
+  static std::shared_ptr<Array> WrapArray(const std::shared_ptr<DataType>& ext_type,
+                                          const std::shared_ptr<Array>& storage);
+  static std::shared_ptr<ChunkedArray> WrapArray(
+      const std::shared_ptr<DataType>& ext_type,
+      const std::shared_ptr<ChunkedArray>& storage);

Review comment:
       @wesm Thoughts about this API?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #8136: ARROW-9078: [C++] Parquet read / write extension type with nested storage type

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #8136:
URL: https://github.com/apache/arrow/pull/8136#issuecomment-691856933


   @emkornfield for my *personal* use case (in context of https://github.com/geopandas/geo-arrow-spec/, see this notebooks with some proposed nested types: https://nbviewer.jupyter.org/gist/jorisvandenbossche/00e5c4a54f7b94375ccc6921c07825a0), the inner fixed size list would not allow nulls, only the most outer variable size list would be nullable. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #8136: ARROW-9078: [C++] Parquet read / write extension type with nested storage type

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #8136:
URL: https://github.com/apache/arrow/pull/8136#issuecomment-690411317


   > One gotcha I ran into while using a fixed size list type (nested within another list) that in this case the extension type wasn't preserved. But I suppose this is because the fixed size list is not yet preserved on parquet roundtrip (read back as normal list type), and to add back the extension type, the storage type needs to be the same. So in hindsight, this behaviour is as expected (as long as fixed size list is not yet supported).
   
   @jorisvandenbossche Reading fixed size lists is not yet supported.  One use-case that will likely not be supported in the short term is reading parquet files with FixedSizeLists that have null elements.  Do your use-cases with FixedSizeList allow for null lists?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #8136: ARROW-9078: [C++] Parquet read / write extension type with nested storage type

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #8136:
URL: https://github.com/apache/arrow/pull/8136


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8136: ARROW-9078: [C++] Parquet read / write extension type with nested storage type

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8136:
URL: https://github.com/apache/arrow/pull/8136#issuecomment-696686570


   I rebased and addressed review comments.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8136: ARROW-9078: [C++] Parquet read / write extension type with nested storage type

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8136:
URL: https://github.com/apache/arrow/pull/8136#discussion_r487920888



##########
File path: cpp/src/parquet/arrow/writer.cc
##########
@@ -63,6 +66,10 @@ using arrow::Status;
 using arrow::Table;
 using arrow::TimeUnit;
 
+using ArrowTypeId = arrow::Type;

Review comment:
       I can revert :-)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #8136: ARROW-9078: [C++] Parquet read / write extension type with nested storage type

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #8136:
URL: https://github.com/apache/arrow/pull/8136#discussion_r486435699



##########
File path: cpp/src/parquet/arrow/schema.cc
##########
@@ -682,48 +689,78 @@ Status GetOriginSchema(const std::shared_ptr<const KeyValueMetadata>& metadata,
 // Restore original Arrow field information that was serialized as Parquet metadata
 // but that is not necessarily present in the field reconstitued from Parquet data
 // (for example, Parquet timestamp types doesn't carry timezone information).
-Status ApplyOriginalMetadata(std::shared_ptr<Field> field, const Field& origin_field,
-                             std::shared_ptr<Field>* out) {
+
+Status ApplyOriginalStorageMetadata(const Field& origin_field, SchemaField* inferred) {
   auto origin_type = origin_field.type();
-  if (field->type()->id() == ::arrow::Type::TIMESTAMP) {
+  auto inferred_type = inferred->field->type();
+
+  if (inferred_type->id() == ::arrow::Type::TIMESTAMP) {
     // Restore time zone, if any
-    const auto& ts_type = static_cast<const ::arrow::TimestampType&>(*field->type());
+    const auto& ts_type = static_cast<const ::arrow::TimestampType&>(*inferred_type);
     const auto& ts_origin_type = static_cast<const ::arrow::TimestampType&>(*origin_type);
 
     // If the unit is the same and the data is tz-aware, then set the original
     // time zone, since Parquet has no native storage for timezones
     if (ts_type.unit() == ts_origin_type.unit() && ts_type.timezone() == "UTC" &&
         ts_origin_type.timezone() != "") {
-      field = field->WithType(origin_type);
+      inferred->field = inferred->field->WithType(origin_type);
     }
   }
+
   if (origin_type->id() == ::arrow::Type::DICTIONARY &&
-      field->type()->id() != ::arrow::Type::DICTIONARY &&
-      IsDictionaryReadSupported(*field->type())) {
+      inferred_type->id() != ::arrow::Type::DICTIONARY &&
+      IsDictionaryReadSupported(*inferred_type)) {
     const auto& dict_origin_type =
         static_cast<const ::arrow::DictionaryType&>(*origin_type);
-    field = field->WithType(
-        ::arrow::dictionary(::arrow::int32(), field->type(), dict_origin_type.ordered()));
+    inferred->field = inferred->field->WithType(
+        ::arrow::dictionary(::arrow::int32(), inferred_type, dict_origin_type.ordered()));
+  }
+
+  // Restore field metadata
+  std::shared_ptr<const KeyValueMetadata> field_metadata = origin_field.metadata();
+  if (field_metadata != nullptr) {
+    if (inferred->field->metadata()) {
+      // Prefer the metadata keys (like field_id) from the current metadata
+      field_metadata = field_metadata->Merge(*inferred->field->metadata());
+    }
+    inferred->field = inferred->field->WithMetadata(field_metadata);
   }
 
   if (origin_type->id() == ::arrow::Type::EXTENSION) {
     // Restore extension type, if the storage type is as read from Parquet
     const auto& ex_type = checked_cast<const ::arrow::ExtensionType&>(*origin_type);
-    if (ex_type.storage_type()->Equals(*field->type())) {
-      field = field->WithType(origin_type);
+    if (ex_type.storage_type()->Equals(*inferred_type)) {
+      inferred->field = inferred->field->WithType(origin_type);
     }
   }
 
-  // Restore field metadata
-  std::shared_ptr<const KeyValueMetadata> field_metadata = origin_field.metadata();
-  if (field_metadata != nullptr) {
-    if (field->metadata()) {
-      // Prefer the metadata keys (like field_id) from the current metadata
-      field_metadata = field_metadata->Merge(*field->metadata());
+  // TODO Should apply metadata recursively, but for that we need to move metadata
+  // application inside NodeToSchemaField (ARROW-9943)
+
+  return Status::OK();
+}
+
+Status ApplyOriginalMetadata(const Field& origin_field, SchemaField* inferred) {
+  auto origin_type = origin_field.type();
+  auto inferred_type = inferred->field->type();
+
+  if (origin_type->id() == ::arrow::Type::EXTENSION) {
+    const auto& ex_type = checked_cast<const ::arrow::ExtensionType&>(*origin_type);
+    auto origin_storage_field = origin_field.WithType(ex_type.storage_type());
+
+    // Apply metadata recursively to storage type
+    RETURN_NOT_OK(ApplyOriginalStorageMetadata(*origin_storage_field, inferred));
+    inferred->storage_field = inferred->field;
+
+    // Restore extension type, if the storage type is as read from Parquet

Review comment:
       isn't everything here read from Parquet (is this in contrast to user provided?)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8136: ARROW-9078: [C++] Parquet read / write extension type with nested storage type

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8136:
URL: https://github.com/apache/arrow/pull/8136#issuecomment-690250651


   @emkornfield Would you like to take a quick look?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8136: ARROW-9078: [C++] Parquet read / write extension type with nested storage type

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8136:
URL: https://github.com/apache/arrow/pull/8136#issuecomment-696686570






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #8136: ARROW-9078: [C++] Parquet read / write extension type with nested storage type

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #8136:
URL: https://github.com/apache/arrow/pull/8136#issuecomment-690369288


   A few minor doc comments otherwise looks ok to me.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org