You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/06 17:06:54 UTC

[GitHub] [arrow] pitrou opened a new pull request #8366: ARROW-9943: [C++] Recursively apply Arrow metadata when reading from Parquet

pitrou opened a new pull request #8366:
URL: https://github.com/apache/arrow/pull/8366


   This allows roundtripping complex types such as `list<dictionary<utf8>>`, `list<extension>`, etc.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8366: ARROW-9943: [C++] Recursively apply Arrow metadata when reading from Parquet

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8366:
URL: https://github.com/apache/arrow/pull/8366#discussion_r500862229



##########
File path: cpp/src/parquet/arrow/schema.cc
##########
@@ -725,23 +778,18 @@ Status ApplyOriginalStorageMetadata(const Field& origin_field, SchemaField* infe
       field_metadata = field_metadata->Merge(*inferred->field->metadata());
     }
     inferred->field = inferred->field->WithMetadata(field_metadata);
-  }
-
-  if (origin_type->id() == ::arrow::Type::EXTENSION) {
-    // Restore extension type, if the storage type is as read from Parquet
-    const auto& ex_type = checked_cast<const ::arrow::ExtensionType&>(*origin_type);
-    if (ex_type.storage_type()->Equals(*inferred_type)) {
-      inferred->field = inferred->field->WithType(origin_type);
-    }
+    modified = true;
   }
 
   // TODO Should apply metadata recursively to children, but for that we need

Review comment:
       Oops, yes.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8366: ARROW-9943: [C++] Recursively apply Arrow metadata when reading from Parquet

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8366:
URL: https://github.com/apache/arrow/pull/8366#discussion_r500863193



##########
File path: cpp/src/parquet/arrow/schema.h
##########
@@ -91,7 +91,6 @@ struct PARQUET_EXPORT SchemaField {
   std::shared_ptr<::arrow::Field> field;
   // If field has an extension type, an equivalent field with the storage type,

Review comment:
       Ah, yes, will remove, sorry.
   You may be right about the issue, though I'm not sure how one would filter out a column from an extension's storage type.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #8366: ARROW-9943: [C++] Recursively apply Arrow metadata when reading from Parquet

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #8366:
URL: https://github.com/apache/arrow/pull/8366#discussion_r500745029



##########
File path: cpp/src/parquet/arrow/schema.cc
##########
@@ -725,23 +778,18 @@ Status ApplyOriginalStorageMetadata(const Field& origin_field, SchemaField* infe
       field_metadata = field_metadata->Merge(*inferred->field->metadata());
     }
     inferred->field = inferred->field->WithMetadata(field_metadata);
-  }
-
-  if (origin_type->id() == ::arrow::Type::EXTENSION) {
-    // Restore extension type, if the storage type is as read from Parquet
-    const auto& ex_type = checked_cast<const ::arrow::ExtensionType&>(*origin_type);
-    if (ex_type.storage_type()->Equals(*inferred_type)) {
-      inferred->field = inferred->field->WithType(origin_type);
-    }
+    modified = true;
   }
 
   // TODO Should apply metadata recursively to children, but for that we need

Review comment:
       remove comment?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #8366: ARROW-9943: [C++] Recursively apply Arrow metadata when reading from Parquet

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #8366:
URL: https://github.com/apache/arrow/pull/8366#discussion_r500744423



##########
File path: cpp/src/parquet/arrow/schema.cc
##########
@@ -689,10 +686,62 @@ Status GetOriginSchema(const std::shared_ptr<const KeyValueMetadata>& metadata,
 // but that is not necessarily present in the field reconstitued from Parquet data
 // (for example, Parquet timestamp types doesn't carry timezone information).
 
-Status ApplyOriginalStorageMetadata(const Field& origin_field, SchemaField* inferred) {
+Result<bool> ApplyOriginalMetadata(const Field& origin_field, SchemaField* inferred);
+
+std::function<std::shared_ptr<::arrow::DataType>(FieldVector)> GetNestedFactory(
+    const ArrowType& origin_type, const ArrowType& inferred_type) {
+  switch (inferred_type.id()) {
+    case ::arrow::Type::STRUCT:
+      if (origin_type.id() == ::arrow::Type::STRUCT) {
+        return ::arrow::struct_;
+      }
+      break;
+    case ::arrow::Type::LIST:
+      // TODO also allow LARGE_LIST and FIXED_SIZE_LIST
+      if (origin_type.id() == ::arrow::Type::LIST) {
+        return [](FieldVector fields) {
+          DCHECK_EQ(fields.size(), 1);
+          return ::arrow::list(std::move(fields[0]));
+        };
+      }
+      break;
+    default:
+      break;
+  }
+  return {};
+}
+
+Result<bool> ApplyOriginalStorageMetadata(const Field& origin_field,
+                                          SchemaField* inferred) {
+  bool modified = false;
+
   auto origin_type = origin_field.type();
   auto inferred_type = inferred->field->type();
 
+  const int num_children = inferred_type->num_fields();
+
+  if (num_children > 0 && origin_type->num_fields() == num_children) {
+    DCHECK_EQ(static_cast<int>(inferred->children.size()), num_children);
+    if (auto factory = GetNestedFactory(*origin_type, *inferred_type)) {

Review comment:
       nit: auto here is a little bit hard for me to read.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8366: ARROW-9943: [C++] Recursively apply Arrow metadata when reading from Parquet

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8366:
URL: https://github.com/apache/arrow/pull/8366#discussion_r501548523



##########
File path: cpp/src/parquet/arrow/schema.h
##########
@@ -91,7 +91,6 @@ struct PARQUET_EXPORT SchemaField {
   std::shared_ptr<::arrow::Field> field;
   // If field has an extension type, an equivalent field with the storage type,

Review comment:
       It would be nice if you could open a JIRA with a reproducer, or at least guidance as to how this can be reproduced.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8366: ARROW-9943: [C++] Recursively apply Arrow metadata when reading from Parquet

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8366:
URL: https://github.com/apache/arrow/pull/8366#issuecomment-704819253


   Will merge if CI green. Thank you for the reviews!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8366: ARROW-9943: [C++] Recursively apply Arrow metadata when reading from Parquet

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8366:
URL: https://github.com/apache/arrow/pull/8366#discussion_r500861539



##########
File path: cpp/src/parquet/arrow/schema.cc
##########
@@ -688,32 +688,21 @@ Status GetOriginSchema(const std::shared_ptr<const KeyValueMetadata>& metadata,
 
 Result<bool> ApplyOriginalMetadata(const Field& origin_field, SchemaField* inferred);
 
-using NestedFieldRewrapper = std::function<Status(FieldVector, SchemaField*)>;
-
-Status RewrapStructField(FieldVector new_children, SchemaField* inferred) {
-  inferred->field = inferred->field->WithType(::arrow::struct_(std::move(new_children)));
-  return Status::OK();
-}
-
-Status RewrapListField(FieldVector new_children, SchemaField* inferred) {
-  DCHECK_EQ(new_children.size(), 1);
-  inferred->field = inferred->field->WithType(::arrow::list(new_children[0]));
-  return Status::OK();
-}
-
-// XXX Perhaps add a DataType::WithFields method?
-NestedFieldRewrapper GetNestedFieldRewrapper(const ArrowType& origin_type,
-                                             const ArrowType& inferred_type) {
+std::function<std::shared_ptr<::arrow::DataType>(FieldVector)> GetNestedFactory(
+    const ArrowType& origin_type, const ArrowType& inferred_type) {
   switch (inferred_type.id()) {
     case ::arrow::Type::STRUCT:
       if (origin_type.id() == ::arrow::Type::STRUCT) {
-        return RewrapStructField;
+        return ::arrow::struct_;
       }
       break;
     case ::arrow::Type::LIST:
       // TODO also allow LARGE_LIST and FIXED_SIZE_LIST
       if (origin_type.id() == ::arrow::Type::LIST) {
-        return RewrapListField;
+        return [](FieldVector fields) {
+          DCHECK_EQ(fields.size(), 1);

Review comment:
       Normally no, because we already checked that `origin_type->num_fields() == inferred_type->num_fields()`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #8366: ARROW-9943: [C++] Recursively apply Arrow metadata when reading from Parquet

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #8366:
URL: https://github.com/apache/arrow/pull/8366#discussion_r500589253



##########
File path: cpp/src/parquet/arrow/schema.cc
##########
@@ -689,10 +686,73 @@ Status GetOriginSchema(const std::shared_ptr<const KeyValueMetadata>& metadata,
 // but that is not necessarily present in the field reconstitued from Parquet data
 // (for example, Parquet timestamp types doesn't carry timezone information).
 
-Status ApplyOriginalStorageMetadata(const Field& origin_field, SchemaField* inferred) {
+Result<bool> ApplyOriginalMetadata(const Field& origin_field, SchemaField* inferred);
+
+using NestedFieldRewrapper = std::function<Status(FieldVector, SchemaField*)>;
+
+Status RewrapStructField(FieldVector new_children, SchemaField* inferred) {
+  inferred->field = inferred->field->WithType(::arrow::struct_(std::move(new_children)));
+  return Status::OK();
+}
+
+Status RewrapListField(FieldVector new_children, SchemaField* inferred) {
+  DCHECK_EQ(new_children.size(), 1);
+  inferred->field = inferred->field->WithType(::arrow::list(new_children[0]));
+  return Status::OK();
+}
+
+// XXX Perhaps add a DataType::WithFields method?
+NestedFieldRewrapper GetNestedFieldRewrapper(const ArrowType& origin_type,
+                                             const ArrowType& inferred_type) {
+  switch (inferred_type.id()) {
+    case ::arrow::Type::STRUCT:
+      if (origin_type.id() == ::arrow::Type::STRUCT) {
+        return RewrapStructField;
+      }
+      break;
+    case ::arrow::Type::LIST:
+      // TODO also allow LARGE_LIST and FIXED_SIZE_LIST
+      if (origin_type.id() == ::arrow::Type::LIST) {
+        return RewrapListField;
+      }
+      break;
+    default:
+      break;
+  }
+  return {};
+}
+
+Result<bool> ApplyOriginalStorageMetadata(const Field& origin_field,
+                                          SchemaField* inferred) {
+  bool modified = false;
+
   auto origin_type = origin_field.type();
   auto inferred_type = inferred->field->type();
 
+  const int num_children = inferred_type->num_fields();
+
+  if (num_children > 0 && origin_type->num_fields() == num_children) {
+    DCHECK_EQ(static_cast<int>(inferred->children.size()), num_children);
+    const auto nested_rewrapper = GetNestedFieldRewrapper(*origin_type, *inferred_type);
+    if (nested_rewrapper) {
+      // Apply original metadata recursively to children
+      for (int i = 0; i < inferred_type->num_fields(); ++i) {
+        ARROW_ASSIGN_OR_RAISE(
+            const bool child_modified,
+            ApplyOriginalMetadata(*origin_type->field(i), &inferred->children[i]));
+        modified |= child_modified;
+      }
+      if (modified) {
+        // Recreate this field using the modified child fields
+        ::arrow::FieldVector modified_children(inferred_type->num_fields());
+        for (int i = 0; i < inferred_type->num_fields(); ++i) {
+          modified_children[i] = inferred->children[i].field;
+        }
+        RETURN_NOT_OK(nested_rewrapper(std::move(modified_children), inferred));
+      }
+    }

Review comment:
       Sure, I'll push a commit




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #8366: ARROW-9943: [C++] Recursively apply Arrow metadata when reading from Parquet

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #8366:
URL: https://github.com/apache/arrow/pull/8366#discussion_r501446625



##########
File path: cpp/src/parquet/arrow/schema.h
##########
@@ -91,7 +91,6 @@ struct PARQUET_EXPORT SchemaField {
   std::shared_ptr<::arrow::Field> field;
   // If field has an extension type, an equivalent field with the storage type,

Review comment:
       To provide more context https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc#L791 GetReader can return a nullptr for a reader, and for struct fields the Types can change if a specific one is filtered https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc#L840




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #8366: ARROW-9943: [C++] Recursively apply Arrow metadata when reading from Parquet

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #8366:
URL: https://github.com/apache/arrow/pull/8366#discussion_r500745907



##########
File path: cpp/src/parquet/arrow/schema.h
##########
@@ -91,7 +91,6 @@ struct PARQUET_EXPORT SchemaField {
   std::shared_ptr<::arrow::Field> field;
   // If field has an extension type, an equivalent field with the storage type,

Review comment:
       Alos, just want to double check that we can always get the raw underlying type from the extension types?  One place where this code could cause some issues is if a leaf column is filtered out but the extension type requires all columns. (this doesn't necessarily apply to this code review as i think this was potentially and already existng bug).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8366: ARROW-9943: [C++] Recursively apply Arrow metadata when reading from Parquet

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8366:
URL: https://github.com/apache/arrow/pull/8366#discussion_r500862082



##########
File path: cpp/src/parquet/arrow/schema.cc
##########
@@ -689,10 +686,62 @@ Status GetOriginSchema(const std::shared_ptr<const KeyValueMetadata>& metadata,
 // but that is not necessarily present in the field reconstitued from Parquet data
 // (for example, Parquet timestamp types doesn't carry timezone information).
 
-Status ApplyOriginalStorageMetadata(const Field& origin_field, SchemaField* inferred) {
+Result<bool> ApplyOriginalMetadata(const Field& origin_field, SchemaField* inferred);
+
+std::function<std::shared_ptr<::arrow::DataType>(FieldVector)> GetNestedFactory(
+    const ArrowType& origin_type, const ArrowType& inferred_type) {
+  switch (inferred_type.id()) {
+    case ::arrow::Type::STRUCT:
+      if (origin_type.id() == ::arrow::Type::STRUCT) {
+        return ::arrow::struct_;
+      }
+      break;
+    case ::arrow::Type::LIST:
+      // TODO also allow LARGE_LIST and FIXED_SIZE_LIST
+      if (origin_type.id() == ::arrow::Type::LIST) {
+        return [](FieldVector fields) {
+          DCHECK_EQ(fields.size(), 1);
+          return ::arrow::list(std::move(fields[0]));
+        };
+      }
+      break;
+    default:
+      break;
+  }
+  return {};
+}
+
+Result<bool> ApplyOriginalStorageMetadata(const Field& origin_field,
+                                          SchemaField* inferred) {
+  bool modified = false;
+
   auto origin_type = origin_field.type();
   auto inferred_type = inferred->field->type();
 
+  const int num_children = inferred_type->num_fields();
+
+  if (num_children > 0 && origin_type->num_fields() == num_children) {
+    DCHECK_EQ(static_cast<int>(inferred->children.size()), num_children);
+    if (auto factory = GetNestedFactory(*origin_type, *inferred_type)) {

Review comment:
       What would you prefer? Avoid combining `if` and assignment?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8366: ARROW-9943: [C++] Recursively apply Arrow metadata when reading from Parquet

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8366:
URL: https://github.com/apache/arrow/pull/8366#discussion_r501548523



##########
File path: cpp/src/parquet/arrow/schema.h
##########
@@ -91,7 +91,6 @@ struct PARQUET_EXPORT SchemaField {
   std::shared_ptr<::arrow::Field> field;
   // If field has an extension type, an equivalent field with the storage type,

Review comment:
       It would be nice if you could open a JIRA with a reproducer, or at least guidance as to how this can be reproduced.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #8366: ARROW-9943: [C++] Recursively apply Arrow metadata when reading from Parquet

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #8366:
URL: https://github.com/apache/arrow/pull/8366#issuecomment-704409884


   Also cc @jorisvandenbossche 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8366: ARROW-9943: [C++] Recursively apply Arrow metadata when reading from Parquet

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8366:
URL: https://github.com/apache/arrow/pull/8366#discussion_r500568316



##########
File path: cpp/src/parquet/arrow/schema.cc
##########
@@ -689,10 +686,73 @@ Status GetOriginSchema(const std::shared_ptr<const KeyValueMetadata>& metadata,
 // but that is not necessarily present in the field reconstitued from Parquet data
 // (for example, Parquet timestamp types doesn't carry timezone information).
 
-Status ApplyOriginalStorageMetadata(const Field& origin_field, SchemaField* inferred) {
+Result<bool> ApplyOriginalMetadata(const Field& origin_field, SchemaField* inferred);
+
+using NestedFieldRewrapper = std::function<Status(FieldVector, SchemaField*)>;
+
+Status RewrapStructField(FieldVector new_children, SchemaField* inferred) {
+  inferred->field = inferred->field->WithType(::arrow::struct_(std::move(new_children)));
+  return Status::OK();
+}
+
+Status RewrapListField(FieldVector new_children, SchemaField* inferred) {
+  DCHECK_EQ(new_children.size(), 1);
+  inferred->field = inferred->field->WithType(::arrow::list(new_children[0]));
+  return Status::OK();
+}
+
+// XXX Perhaps add a DataType::WithFields method?
+NestedFieldRewrapper GetNestedFieldRewrapper(const ArrowType& origin_type,
+                                             const ArrowType& inferred_type) {
+  switch (inferred_type.id()) {
+    case ::arrow::Type::STRUCT:
+      if (origin_type.id() == ::arrow::Type::STRUCT) {
+        return RewrapStructField;
+      }
+      break;
+    case ::arrow::Type::LIST:
+      // TODO also allow LARGE_LIST and FIXED_SIZE_LIST
+      if (origin_type.id() == ::arrow::Type::LIST) {
+        return RewrapListField;
+      }
+      break;
+    default:
+      break;
+  }
+  return {};
+}
+
+Result<bool> ApplyOriginalStorageMetadata(const Field& origin_field,
+                                          SchemaField* inferred) {
+  bool modified = false;
+
   auto origin_type = origin_field.type();
   auto inferred_type = inferred->field->type();
 
+  const int num_children = inferred_type->num_fields();
+
+  if (num_children > 0 && origin_type->num_fields() == num_children) {
+    DCHECK_EQ(static_cast<int>(inferred->children.size()), num_children);
+    const auto nested_rewrapper = GetNestedFieldRewrapper(*origin_type, *inferred_type);
+    if (nested_rewrapper) {
+      // Apply original metadata recursively to children
+      for (int i = 0; i < inferred_type->num_fields(); ++i) {
+        ARROW_ASSIGN_OR_RAISE(
+            const bool child_modified,
+            ApplyOriginalMetadata(*origin_type->field(i), &inferred->children[i]));
+        modified |= child_modified;
+      }
+      if (modified) {
+        // Recreate this field using the modified child fields
+        ::arrow::FieldVector modified_children(inferred_type->num_fields());
+        for (int i = 0; i < inferred_type->num_fields(); ++i) {
+          modified_children[i] = inferred->children[i].field;
+        }
+        RETURN_NOT_OK(nested_rewrapper(std::move(modified_children), inferred));
+      }
+    }

Review comment:
       Do you want to try and make this change?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #8366: ARROW-9943: [C++] Recursively apply Arrow metadata when reading from Parquet

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #8366:
URL: https://github.com/apache/arrow/pull/8366#discussion_r500552098



##########
File path: cpp/src/parquet/arrow/schema.cc
##########
@@ -689,10 +686,73 @@ Status GetOriginSchema(const std::shared_ptr<const KeyValueMetadata>& metadata,
 // but that is not necessarily present in the field reconstitued from Parquet data
 // (for example, Parquet timestamp types doesn't carry timezone information).
 
-Status ApplyOriginalStorageMetadata(const Field& origin_field, SchemaField* inferred) {
+Result<bool> ApplyOriginalMetadata(const Field& origin_field, SchemaField* inferred);
+
+using NestedFieldRewrapper = std::function<Status(FieldVector, SchemaField*)>;
+
+Status RewrapStructField(FieldVector new_children, SchemaField* inferred) {
+  inferred->field = inferred->field->WithType(::arrow::struct_(std::move(new_children)));
+  return Status::OK();
+}
+
+Status RewrapListField(FieldVector new_children, SchemaField* inferred) {
+  DCHECK_EQ(new_children.size(), 1);
+  inferred->field = inferred->field->WithType(::arrow::list(new_children[0]));
+  return Status::OK();
+}
+
+// XXX Perhaps add a DataType::WithFields method?
+NestedFieldRewrapper GetNestedFieldRewrapper(const ArrowType& origin_type,
+                                             const ArrowType& inferred_type) {
+  switch (inferred_type.id()) {
+    case ::arrow::Type::STRUCT:
+      if (origin_type.id() == ::arrow::Type::STRUCT) {
+        return RewrapStructField;
+      }
+      break;
+    case ::arrow::Type::LIST:
+      // TODO also allow LARGE_LIST and FIXED_SIZE_LIST
+      if (origin_type.id() == ::arrow::Type::LIST) {
+        return RewrapListField;
+      }
+      break;
+    default:
+      break;
+  }
+  return {};
+}

Review comment:
       IMHO it's more natural to provide a factory for data types:
   ```suggestion
   std::function<std::shared_ptr<DataType>(FieldVector)> GetNestedFactory(
       const ArrowType& origin_type, const ArrowType& inferred_type) {
     switch (inferred_type.id()) {
       case ::arrow::Type::STRUCT:
         if (origin_type.id() == ::arrow::Type::STRUCT) {
           return ::arrow::struct_;
         }
         break;
       case ::arrow::Type::LIST:
         // TODO also allow LARGE_LIST and FIXED_SIZE_LIST
         if (origin_type.id() == ::arrow::Type::LIST) {
           return [](FieldVector fields) { return ::arrow::list(std::move(fields[0])); };
         }
         break;
       default:
         break;
     }
     return {};
   }
   ```

##########
File path: cpp/src/parquet/arrow/schema.cc
##########
@@ -689,10 +686,73 @@ Status GetOriginSchema(const std::shared_ptr<const KeyValueMetadata>& metadata,
 // but that is not necessarily present in the field reconstitued from Parquet data
 // (for example, Parquet timestamp types doesn't carry timezone information).
 
-Status ApplyOriginalStorageMetadata(const Field& origin_field, SchemaField* inferred) {
+Result<bool> ApplyOriginalMetadata(const Field& origin_field, SchemaField* inferred);
+
+using NestedFieldRewrapper = std::function<Status(FieldVector, SchemaField*)>;
+
+Status RewrapStructField(FieldVector new_children, SchemaField* inferred) {
+  inferred->field = inferred->field->WithType(::arrow::struct_(std::move(new_children)));
+  return Status::OK();
+}
+
+Status RewrapListField(FieldVector new_children, SchemaField* inferred) {
+  DCHECK_EQ(new_children.size(), 1);
+  inferred->field = inferred->field->WithType(::arrow::list(new_children[0]));
+  return Status::OK();
+}
+
+// XXX Perhaps add a DataType::WithFields method?
+NestedFieldRewrapper GetNestedFieldRewrapper(const ArrowType& origin_type,
+                                             const ArrowType& inferred_type) {
+  switch (inferred_type.id()) {
+    case ::arrow::Type::STRUCT:
+      if (origin_type.id() == ::arrow::Type::STRUCT) {
+        return RewrapStructField;
+      }
+      break;
+    case ::arrow::Type::LIST:
+      // TODO also allow LARGE_LIST and FIXED_SIZE_LIST
+      if (origin_type.id() == ::arrow::Type::LIST) {
+        return RewrapListField;
+      }
+      break;
+    default:
+      break;
+  }
+  return {};
+}
+
+Result<bool> ApplyOriginalStorageMetadata(const Field& origin_field,
+                                          SchemaField* inferred) {
+  bool modified = false;
+
   auto origin_type = origin_field.type();
   auto inferred_type = inferred->field->type();
 
+  const int num_children = inferred_type->num_fields();
+
+  if (num_children > 0 && origin_type->num_fields() == num_children) {
+    DCHECK_EQ(static_cast<int>(inferred->children.size()), num_children);
+    const auto nested_rewrapper = GetNestedFieldRewrapper(*origin_type, *inferred_type);
+    if (nested_rewrapper) {
+      // Apply original metadata recursively to children
+      for (int i = 0; i < inferred_type->num_fields(); ++i) {
+        ARROW_ASSIGN_OR_RAISE(
+            const bool child_modified,
+            ApplyOriginalMetadata(*origin_type->field(i), &inferred->children[i]));
+        modified |= child_modified;
+      }
+      if (modified) {
+        // Recreate this field using the modified child fields
+        ::arrow::FieldVector modified_children(inferred_type->num_fields());
+        for (int i = 0; i < inferred_type->num_fields(); ++i) {
+          modified_children[i] = inferred->children[i].field;
+        }
+        RETURN_NOT_OK(nested_rewrapper(std::move(modified_children), inferred));
+      }
+    }

Review comment:
       ```suggestion
       const auto nested_factory = GetNestedFactory(*origin_type, *inferred_type);
       if (nested_factory) {
         // Apply original metadata recursively to children
         for (int i = 0; i < inferred_type->num_fields(); ++i) {
           ARROW_ASSIGN_OR_RAISE(
               const bool child_modified,
               ApplyOriginalMetadata(*origin_type->field(i), &inferred->children[i]));
           modified |= child_modified;
         }
         if (modified) {
           // Recreate this field using the modified child fields
           ::arrow::FieldVector modified_children(inferred_type->num_fields());
           for (int i = 0; i < inferred_type->num_fields(); ++i) {
             modified_children[i] = inferred->children[i].field;
           }
           inferred.field = inferred.field->WithType(nested_factory(std::move(modified_children)));
         }
       }
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #8366: ARROW-9943: [C++] Recursively apply Arrow metadata when reading from Parquet

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #8366:
URL: https://github.com/apache/arrow/pull/8366#discussion_r500742385



##########
File path: cpp/src/parquet/arrow/schema.cc
##########
@@ -688,32 +688,21 @@ Status GetOriginSchema(const std::shared_ptr<const KeyValueMetadata>& metadata,
 
 Result<bool> ApplyOriginalMetadata(const Field& origin_field, SchemaField* inferred);
 
-using NestedFieldRewrapper = std::function<Status(FieldVector, SchemaField*)>;
-
-Status RewrapStructField(FieldVector new_children, SchemaField* inferred) {
-  inferred->field = inferred->field->WithType(::arrow::struct_(std::move(new_children)));
-  return Status::OK();
-}
-
-Status RewrapListField(FieldVector new_children, SchemaField* inferred) {
-  DCHECK_EQ(new_children.size(), 1);
-  inferred->field = inferred->field->WithType(::arrow::list(new_children[0]));
-  return Status::OK();
-}
-
-// XXX Perhaps add a DataType::WithFields method?
-NestedFieldRewrapper GetNestedFieldRewrapper(const ArrowType& origin_type,
-                                             const ArrowType& inferred_type) {
+std::function<std::shared_ptr<::arrow::DataType>(FieldVector)> GetNestedFactory(
+    const ArrowType& origin_type, const ArrowType& inferred_type) {
   switch (inferred_type.id()) {
     case ::arrow::Type::STRUCT:
       if (origin_type.id() == ::arrow::Type::STRUCT) {
-        return RewrapStructField;
+        return ::arrow::struct_;
       }
       break;
     case ::arrow::Type::LIST:
       // TODO also allow LARGE_LIST and FIXED_SIZE_LIST
       if (origin_type.id() == ::arrow::Type::LIST) {
-        return RewrapListField;
+        return [](FieldVector fields) {
+          DCHECK_EQ(fields.size(), 1);

Review comment:
       Can a schema be deserialized that has more then 1 field? i.e. should this return a user space error instead?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8366: ARROW-9943: [C++] Recursively apply Arrow metadata when reading from Parquet

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8366:
URL: https://github.com/apache/arrow/pull/8366#issuecomment-704421003


   https://issues.apache.org/jira/browse/ARROW-9943


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #8366: ARROW-9943: [C++] Recursively apply Arrow metadata when reading from Parquet

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #8366:
URL: https://github.com/apache/arrow/pull/8366#discussion_r501446625



##########
File path: cpp/src/parquet/arrow/schema.h
##########
@@ -91,7 +91,6 @@ struct PARQUET_EXPORT SchemaField {
   std::shared_ptr<::arrow::Field> field;
   // If field has an extension type, an equivalent field with the storage type,

Review comment:
       To provide more context https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc#L791 GetReader can return a nullptr for a reader, and for struct fields the Types can change if a specific one is filtered https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc#L840




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #8366: ARROW-9943: [C++] Recursively apply Arrow metadata when reading from Parquet

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #8366:
URL: https://github.com/apache/arrow/pull/8366


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8366: ARROW-9943: [C++] Recursively apply Arrow metadata when reading from Parquet

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8366:
URL: https://github.com/apache/arrow/pull/8366#discussion_r500443163



##########
File path: cpp/src/parquet/arrow/schema.cc
##########
@@ -689,10 +686,91 @@ Status GetOriginSchema(const std::shared_ptr<const KeyValueMetadata>& metadata,
 // but that is not necessarily present in the field reconstitued from Parquet data
 // (for example, Parquet timestamp types doesn't carry timezone information).
 
-Status ApplyOriginalStorageMetadata(const Field& origin_field, SchemaField* inferred) {
+Result<bool> ApplyOriginalMetadata(const Field& origin_field, SchemaField* inferred);
+
+using NestedFieldRewrapper = std::function<Status(FieldVector, SchemaField*)>;
+
+Status RewrapStructField(FieldVector new_children, SchemaField* inferred) {
+  inferred->field = inferred->field->WithType(::arrow::struct_(std::move(new_children)));
+  return Status::OK();
+}
+
+Status RewrapListField(FieldVector new_children, SchemaField* inferred) {
+  DCHECK_EQ(new_children.size(), 1);
+  inferred->field = inferred->field->WithType(::arrow::list(new_children[0]));
+  return Status::OK();
+}
+
+NestedFieldRewrapper GetNestedFieldRewrapper(const ArrowType& origin_type,
+                                             const ArrowType& inferred_type) {
+  switch (inferred_type.id()) {
+    case ::arrow::Type::STRUCT:
+      if (origin_type.id() == ::arrow::Type::STRUCT) {
+        return RewrapStructField;
+      }
+      break;
+    case ::arrow::Type::LIST:
+      // TODO also allow LARGE_LIST and FIXED_SIZE_LIST

Review comment:
       @emkornfield I think you'll be able to add your changes here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #8366: ARROW-9943: [C++] Recursively apply Arrow metadata when reading from Parquet

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #8366:
URL: https://github.com/apache/arrow/pull/8366#discussion_r500875953



##########
File path: cpp/src/parquet/arrow/schema.cc
##########
@@ -689,10 +686,62 @@ Status GetOriginSchema(const std::shared_ptr<const KeyValueMetadata>& metadata,
 // but that is not necessarily present in the field reconstitued from Parquet data
 // (for example, Parquet timestamp types doesn't carry timezone information).
 
-Status ApplyOriginalStorageMetadata(const Field& origin_field, SchemaField* inferred) {
+Result<bool> ApplyOriginalMetadata(const Field& origin_field, SchemaField* inferred);
+
+std::function<std::shared_ptr<::arrow::DataType>(FieldVector)> GetNestedFactory(
+    const ArrowType& origin_type, const ArrowType& inferred_type) {
+  switch (inferred_type.id()) {
+    case ::arrow::Type::STRUCT:
+      if (origin_type.id() == ::arrow::Type::STRUCT) {
+        return ::arrow::struct_;
+      }
+      break;
+    case ::arrow::Type::LIST:
+      // TODO also allow LARGE_LIST and FIXED_SIZE_LIST
+      if (origin_type.id() == ::arrow::Type::LIST) {
+        return [](FieldVector fields) {
+          DCHECK_EQ(fields.size(), 1);
+          return ::arrow::list(std::move(fields[0]));
+        };
+      }
+      break;
+    default:
+      break;
+  }
+  return {};
+}
+
+Result<bool> ApplyOriginalStorageMetadata(const Field& origin_field,
+                                          SchemaField* inferred) {
+  bool modified = false;
+
   auto origin_type = origin_field.type();
   auto inferred_type = inferred->field->type();
 
+  const int num_children = inferred_type->num_fields();
+
+  if (num_children > 0 && origin_type->num_fields() == num_children) {
+    DCHECK_EQ(static_cast<int>(inferred->children.size()), num_children);
+    if (auto factory = GetNestedFactory(*origin_type, *inferred_type)) {

Review comment:
       I've put the assignment on its own line.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #8366: ARROW-9943: [C++] Recursively apply Arrow metadata when reading from Parquet

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #8366:
URL: https://github.com/apache/arrow/pull/8366#discussion_r500742727



##########
File path: cpp/src/parquet/arrow/schema.h
##########
@@ -91,7 +91,6 @@ struct PARQUET_EXPORT SchemaField {
   std::shared_ptr<::arrow::Field> field;
   // If field has an extension type, an equivalent field with the storage type,

Review comment:
       remove the docs also?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org