You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/06/17 11:18:31 UTC

[GitHub] [iceberg] singhpk234 opened a new pull request, #5075: Spark: Add `__metadata_col` metadata in metadata columns when doing Schema Conversion

singhpk234 opened a new pull request, #5075:
URL: https://github.com/apache/iceberg/pull/5075

   ### About the change : 
   
   Presently when doing schema conversion we were setting `metadata` for all the the columns, but we should add `____metadata_col` in metadata column which can be used by spark to check if the column is `metadata` col and drop it if required like, 
   here : 
   https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala#L221-L225
   
   Note : `MetadataAttribute` extractor uses the above key in attribute meta data to find if the attribute is metadata attribute or not.
   
   This PR includes : 
   (i) Fix for above 
   (ii) Fix an existing minor typo in `TestSparkSchemaUtil`
   
   ---- 
   ### Testing Done
   Added an UT for the same.
   
   cc @rdblue @aokolnychyi @jackye1995 @RussellSpitzer 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #5075: Spark: Add `__metadata_col` metadata in metadata columns when doing Schema Conversion

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #5075:
URL: https://github.com/apache/iceberg/pull/5075#discussion_r906567399


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/TypeToSparkType.java:
##########
@@ -59,8 +61,11 @@ public DataType struct(Types.StructType struct, List<DataType> fieldResults) {
     for (int i = 0; i < fields.size(); i += 1) {
       Types.NestedField field = fields.get(i);
       DataType type = fieldResults.get(i);
-      StructField sparkField = StructField.apply(
-          field.name(), type, field.isOptional(), Metadata.empty());
+      Metadata metadata = Metadata.empty();
+      if (MetadataColumns.isMetadataColumn(field.name())) {
+        metadata = new MetadataBuilder().putBoolean(MetadataColumns.METADATA_COL_ATTR_KEY, true).build();
+      }

Review Comment:
   Could this be refactored into a separate method? I think that would be cleaner. And, this should identify metadata columns using column IDs, since they are available in Iceberg schemas.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] singhpk234 commented on a diff in pull request #5075: Spark: Add `__metadata_col` metadata in metadata columns when doing Schema Conversion

Posted by GitBox <gi...@apache.org>.
singhpk234 commented on code in PR #5075:
URL: https://github.com/apache/iceberg/pull/5075#discussion_r906699018


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/TypeToSparkType.java:
##########
@@ -59,8 +61,11 @@ public DataType struct(Types.StructType struct, List<DataType> fieldResults) {
     for (int i = 0; i < fields.size(); i += 1) {
       Types.NestedField field = fields.get(i);
       DataType type = fieldResults.get(i);
-      StructField sparkField = StructField.apply(
-          field.name(), type, field.isOptional(), Metadata.empty());
+      Metadata metadata = Metadata.empty();
+      if (MetadataColumns.isMetadataColumn(field.name())) {
+        metadata = new MetadataBuilder().putBoolean(MetadataColumns.METADATA_COL_ATTR_KEY, true).build();
+      }

Review Comment:
   makes sense, made it consistent to 3.3 review comments as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #5075: Spark: Add `__metadata_col` metadata in metadata columns when doing Schema Conversion

Posted by GitBox <gi...@apache.org>.
rdblue merged PR #5075:
URL: https://github.com/apache/iceberg/pull/5075


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #5075: Spark: Add `__metadata_col` metadata in metadata columns when doing Schema Conversion

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #5075:
URL: https://github.com/apache/iceberg/pull/5075#issuecomment-1168126622

   Thanks, @singhpk234!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org