You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/08/13 15:23:17 UTC

[GitHub] [iceberg] aokolnychyi opened a new pull request #1335: Spark: Follow name mapping while importing Parquet tables

aokolnychyi opened a new pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335


   We have to follow name mapping while importing Parquet tables so that metrics are collected correctly. Right now, we may have corrupted metadata if the underlying name mapping does not match ids that would be generated based on positions.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#issuecomment-673552161


   Could you take a look, @rdsr @shardulm94 @rdblue @chenjunjiedada? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#issuecomment-674189505


   This one is ready for another review round, @rdblue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#discussion_r470038672



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java
##########
@@ -136,6 +154,19 @@ public static Metrics footerMetrics(ParquetMetadata metadata, MetricsConfig metr
         toBufferMap(fileSchema, lowerBounds), toBufferMap(fileSchema, upperBounds));
   }
 
+  private static MessageType getParquetType(ParquetMetadata metadata, Schema expectedSchema, NameMapping nameMapping) {

Review comment:
       The idea here is that we should import metrics only for columns in the current schema.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#discussion_r470923047



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java
##########
@@ -86,15 +95,23 @@ public static Metrics footerMetrics(ParquetMetadata metadata, MetricsConfig metr
     Map<Integer, Literal<?>> upperBounds = Maps.newHashMap();
     Set<Integer> missingStats = Sets.newHashSet();
 
-    MessageType parquetType = metadata.getFileMetaData().getSchema();
-    Schema fileSchema = ParquetSchemaUtil.convert(parquetType);
+    // ignore metrics for fields we failed to determine reliable IDs
+    MessageType parquetTypeWithIds = getParquetTypeWithIds(metadata, nameMapping);
+    Schema fileSchema = ParquetSchemaUtil.convert(parquetTypeWithIds, name -> null);

Review comment:
       Updated.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#issuecomment-673552161


   Could you take a look, @rdsr @shardulm94 @rdblue @chenjunjiedada @RussellSpitzer ? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#discussion_r470035342



##########
File path: core/src/main/java/org/apache/iceberg/mapping/NameMapping.java
##########
@@ -43,31 +44,45 @@ public static NameMapping of(MappedFields fields) {
   }
 
   private final MappedFields mapping;
-  private final Map<Integer, MappedField> fieldsById;
-  private final Map<String, MappedField> fieldsByName;
+  private transient Map<Integer, MappedField> fieldsById;
+  private transient Map<String, MappedField> fieldsByName;
 
   NameMapping(MappedFields mapping) {
     this.mapping = mapping;
-    this.fieldsById = MappingUtil.indexById(mapping);
-    this.fieldsByName = MappingUtil.indexByName(mapping);
+    lazyFieldsById();

Review comment:
       Same here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#discussion_r470197459



##########
File path: core/src/main/java/org/apache/iceberg/mapping/NameMapping.java
##########
@@ -27,7 +28,7 @@
 /**
  * Represents a mapping from external schema names to Iceberg type IDs.
  */
-public class NameMapping {
+public class NameMapping implements Serializable {

Review comment:
       Yeah, we have a number of places like this to fix, including `Schema`.
   
   I've created #1337 to follow up on this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi merged pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
aokolnychyi merged pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#discussion_r470923002



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/MessageTypeToType.java
##########
@@ -56,36 +60,38 @@
 
   @Override
   public Type message(MessageType message, List<Type> fields) {
-    return struct(message, fields);
+    Type struct = struct(message, fields);
+    return struct != null ? struct : Types.StructType.of(Lists.newArrayList());
   }
 
   @Override
   public Type struct(GroupType struct, List<Type> fieldTypes) {
-    if (struct == root) {
-      nextId = 1; // use the reserved IDs for the root struct
-    }
-
     List<org.apache.parquet.schema.Type> parquetFields = struct.getFields();
-    List<Types.NestedField> fields = Lists.newArrayListWithExpectedSize(fieldTypes.size());
+    List<Types.NestedField> fields = Lists.newArrayList();

Review comment:
       Agree




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#discussion_r470767005



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/BaseMessageTypeToType.java
##########
@@ -0,0 +1,127 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;
+
+import java.util.Optional;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.types.Type;
+import org.apache.iceberg.types.Types;
+import org.apache.iceberg.types.Types.TimestampType;
+import org.apache.parquet.schema.LogicalTypeAnnotation;
+import org.apache.parquet.schema.PrimitiveType;
+
+abstract class BaseMessageTypeToType extends ParquetTypeVisitor<Type> {

Review comment:
       I've refactored the common logic in `MessageTypeToType` to a base class.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#discussion_r470256559



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java
##########
@@ -136,6 +154,19 @@ public static Metrics footerMetrics(ParquetMetadata metadata, MetricsConfig metr
         toBufferMap(fileSchema, lowerBounds), toBufferMap(fileSchema, upperBounds));
   }
 
+  private static MessageType getParquetType(ParquetMetadata metadata, Schema expectedSchema, NameMapping nameMapping) {
+    MessageType type = metadata.getFileMetaData().getSchema();
+    if (ParquetSchemaUtil.hasIds(type)) {
+      return expectedSchema != null ? ParquetSchemaUtil.pruneColumns(type, expectedSchema) : type;
+    } else if (nameMapping != null) {
+      MessageType typeWithIds = ParquetSchemaUtil.applyNameMapping(type, nameMapping);

Review comment:
       What about a version of `MessageTypeToType` that removes columns without IDs rather than assigning new ones? That would be more flexible than failing if unassigned columns are present.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#discussion_r470201424



##########
File path: core/src/main/java/org/apache/iceberg/mapping/NameMapping.java
##########
@@ -27,7 +28,7 @@
 /**
  * Represents a mapping from external schema names to Iceberg type IDs.
  */
-public class NameMapping {
+public class NameMapping implements Serializable {

Review comment:
       Thanks! I don't think `Schema` needs to be updated, since it only uses Guava classes for `transient` fields that get re-initialized after serialization.
   
   Looks like that applies here as well. The change I was referring to is the list of fields in `MappedFields`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#discussion_r470252426



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java
##########
@@ -136,6 +154,19 @@ public static Metrics footerMetrics(ParquetMetadata metadata, MetricsConfig metr
         toBufferMap(fileSchema, lowerBounds), toBufferMap(fileSchema, upperBounds));
   }
 
+  private static MessageType getParquetType(ParquetMetadata metadata, Schema expectedSchema, NameMapping nameMapping) {
+    MessageType type = metadata.getFileMetaData().getSchema();
+    if (ParquetSchemaUtil.hasIds(type)) {
+      return expectedSchema != null ? ParquetSchemaUtil.pruneColumns(type, expectedSchema) : type;
+    } else if (nameMapping != null) {
+      MessageType typeWithIds = ParquetSchemaUtil.applyNameMapping(type, nameMapping);

Review comment:
       We can also add a flag to `MessageTypeToType` to throw an exception instead of assigning an id.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#discussion_r470244566



##########
File path: core/src/main/java/org/apache/iceberg/mapping/NameMapping.java
##########
@@ -27,7 +28,7 @@
 /**
  * Represents a mapping from external schema names to Iceberg type IDs.
  */
-public class NameMapping {
+public class NameMapping implements Serializable {

Review comment:
       Yeah, you are right, I overlooked.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#discussion_r470918339



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java
##########
@@ -86,15 +95,23 @@ public static Metrics footerMetrics(ParquetMetadata metadata, MetricsConfig metr
     Map<Integer, Literal<?>> upperBounds = Maps.newHashMap();
     Set<Integer> missingStats = Sets.newHashSet();
 
-    MessageType parquetType = metadata.getFileMetaData().getSchema();
-    Schema fileSchema = ParquetSchemaUtil.convert(parquetType);
+    // ignore metrics for fields we failed to determine reliable IDs
+    MessageType parquetTypeWithIds = getParquetTypeWithIds(metadata, nameMapping);
+    Schema fileSchema = ParquetSchemaUtil.convert(parquetTypeWithIds, name -> null);

Review comment:
       What about naming this `convertAndPrune`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#discussion_r470043363



##########
File path: spark2/src/test/java/org/apache/iceberg/spark/source/TestSparkTableUtil.java
##########
@@ -239,18 +238,13 @@ public void testImportWithNameMapping() throws Exception {
     List<String> actual = spark.read().format("iceberg").load(DB_NAME + ".target_table")
         .select("data")
         .sort("data")
-        .filter("data<'c'")
-        .collectAsList()
-        .stream()
-        .map(r -> r.getString(0))
-        .collect(Collectors.toList());
+        .filter("data >= 'b'")

Review comment:
       This would fail before the PR as the metrics were imported in the wrong way. The previous predicate did not trigger this issue, so I changed it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#discussion_r470189838



##########
File path: core/src/main/java/org/apache/iceberg/mapping/NameMapping.java
##########
@@ -27,7 +28,7 @@
 /**
  * Represents a mapping from external schema names to Iceberg type IDs.
  */
-public class NameMapping {
+public class NameMapping implements Serializable {

Review comment:
       Probably better to switch to using arrays. Some people still use Kryo for task serialization and we don't want that to fail.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#discussion_r470037828



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java
##########
@@ -136,6 +154,19 @@ public static Metrics footerMetrics(ParquetMetadata metadata, MetricsConfig metr
         toBufferMap(fileSchema, lowerBounds), toBufferMap(fileSchema, upperBounds));
   }
 
+  private static MessageType getParquetType(ParquetMetadata metadata, Schema expectedSchema, NameMapping nameMapping) {
+    MessageType type = metadata.getFileMetaData().getSchema();
+    if (ParquetSchemaUtil.hasIds(type)) {
+      return expectedSchema != null ? ParquetSchemaUtil.pruneColumns(type, expectedSchema) : type;
+    } else if (nameMapping != null) {
+      MessageType typeWithIds = ParquetSchemaUtil.applyNameMapping(type, nameMapping);

Review comment:
       We have to prune columns without ids. Otherwise, `ParquetSchemaUtil.convert` will assign fallback ids to columns not present in the name mapping and they will most likely conflict with ids assigned by the name mapping.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#discussion_r470193713



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java
##########
@@ -136,6 +154,19 @@ public static Metrics footerMetrics(ParquetMetadata metadata, MetricsConfig metr
         toBufferMap(fileSchema, lowerBounds), toBufferMap(fileSchema, upperBounds));
   }
 
+  private static MessageType getParquetType(ParquetMetadata metadata, Schema expectedSchema, NameMapping nameMapping) {
+    MessageType type = metadata.getFileMetaData().getSchema();
+    if (ParquetSchemaUtil.hasIds(type)) {
+      return expectedSchema != null ? ParquetSchemaUtil.pruneColumns(type, expectedSchema) : type;
+    } else if (nameMapping != null) {
+      MessageType typeWithIds = ParquetSchemaUtil.applyNameMapping(type, nameMapping);

Review comment:
       Do we need `expectedSchema` to do this?
   
   It seems like it would be easier if we didn't need it. What we want is to use the IDs from the mapping without assigning new IDs for columns that don't have them during conversion. So why not make a version of `convert` that doesn't assign columns? That seems much easier than doing column pruning and it doesn't require a new schema -- which introduced new methods everywhere to pass that new schema.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#discussion_r470922931



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/MessageTypeToType.java
##########
@@ -56,36 +60,38 @@
 
   @Override
   public Type message(MessageType message, List<Type> fields) {
-    return struct(message, fields);
+    Type struct = struct(message, fields);
+    return struct != null ? struct : Types.StructType.of(Lists.newArrayList());

Review comment:
       For example, `testEmptyProjection` in `TestReadProjection`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#discussion_r470917853



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetSchemaUtil.java
##########
@@ -41,8 +43,29 @@ public static MessageType convert(Schema schema, String name) {
     return new TypeToMessageType().convert(schema, name);
   }
 
+  /**
+   * Converts a Parquet schema to an Iceberg schema. Fields without IDs are kept and assigned fallback IDs.
+   *
+   * @param parquetSchema a Parquet schema
+   * @return a matching Iceberg schema for the provided Parquet schema
+   */
   public static Schema convert(MessageType parquetSchema) {
-    MessageTypeToType converter = new MessageTypeToType(parquetSchema);
+    // if the Parquet schema does not contain ids, we assign fallback ids to top-level fields
+    // all remaining fields will get ids >= 1000 to avoid pruning columns without ids
+    MessageType parquetSchemaWithIds = hasIds(parquetSchema) ? parquetSchema : addFallbackIds(parquetSchema);
+    AtomicInteger nextId = new AtomicInteger(1000);
+    return convert(parquetSchemaWithIds, name -> nextId.getAndIncrement());
+  }
+
+  /**
+   * Converts a Parquet schema to an Iceberg schema. Fields without IDs are pruned.
+   *
+   * @param parquetSchema a Parquet schema
+   * @param nameToIdFunc a name to field id mapping function
+   * @return a matching Iceberg schema for the provided Parquet schema
+   */
+  public static Schema convert(MessageType parquetSchema, Function<String[], Integer> nameToIdFunc) {

Review comment:
       Do we want to expose `nameToIdFunc` or do we want to have one that accepts a `NameMapping`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#discussion_r470917979



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java
##########
@@ -86,15 +95,23 @@ public static Metrics footerMetrics(ParquetMetadata metadata, MetricsConfig metr
     Map<Integer, Literal<?>> upperBounds = Maps.newHashMap();
     Set<Integer> missingStats = Sets.newHashSet();
 
-    MessageType parquetType = metadata.getFileMetaData().getSchema();
-    Schema fileSchema = ParquetSchemaUtil.convert(parquetType);
+    // ignore metrics for fields we failed to determine reliable IDs
+    MessageType parquetTypeWithIds = getParquetTypeWithIds(metadata, nameMapping);
+    Schema fileSchema = ParquetSchemaUtil.convert(parquetTypeWithIds, name -> null);

Review comment:
       What about `convertWithFallbackIds`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#discussion_r470767691



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/MessageTypeToTypeWithoutAssigningIds.java
##########
@@ -0,0 +1,144 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;
+
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.relocated.com.google.common.base.Joiner;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.types.Type;
+import org.apache.iceberg.types.Types;
+import org.apache.parquet.schema.GroupType;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.Type.Repetition;
+
+import static org.apache.iceberg.types.Types.NestedField.optional;
+import static org.apache.iceberg.types.Types.NestedField.required;
+
+/**
+ * A visitor that converts a {@link MessageType} to a {@link Type} in Iceberg without assigning ids.
+ * Columns without ids are pruned.
+ */
+public class MessageTypeToTypeWithoutAssigningIds extends BaseMessageTypeToType {
+  private static final Joiner DOT = Joiner.on(".");
+
+  private final Map<String, Integer> aliasToId = Maps.newHashMap();
+
+  MessageTypeToTypeWithoutAssigningIds() {}
+
+  public Map<String, Integer> getAliases() {
+    return aliasToId;
+  }
+
+  @Override
+  public Type message(MessageType message, List<Type> fields) {

Review comment:
       This is the same in both visitors. Should it be moved into the base class?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#discussion_r470900103



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/MessageTypeToTypeWithoutAssigningIds.java
##########
@@ -0,0 +1,144 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;
+
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.relocated.com.google.common.base.Joiner;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.types.Type;
+import org.apache.iceberg.types.Types;
+import org.apache.parquet.schema.GroupType;
+import org.apache.parquet.schema.MessageType;
+import org.apache.parquet.schema.Type.Repetition;
+
+import static org.apache.iceberg.types.Types.NestedField.optional;
+import static org.apache.iceberg.types.Types.NestedField.required;
+
+/**
+ * A visitor that converts a {@link MessageType} to a {@link Type} in Iceberg without assigning ids.
+ * Columns without ids are pruned.
+ */
+public class MessageTypeToTypeWithoutAssigningIds extends BaseMessageTypeToType {
+  private static final Joiner DOT = Joiner.on(".");
+
+  private final Map<String, Integer> aliasToId = Maps.newHashMap();
+
+  MessageTypeToTypeWithoutAssigningIds() {}
+
+  public Map<String, Integer> getAliases() {
+    return aliasToId;
+  }
+
+  @Override
+  public Type message(MessageType message, List<Type> fields) {

Review comment:
       Updated `MessageTypeToType` instead.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#discussion_r470917409



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/MessageTypeToType.java
##########
@@ -56,36 +60,38 @@
 
   @Override
   public Type message(MessageType message, List<Type> fields) {
-    return struct(message, fields);
+    Type struct = struct(message, fields);
+    return struct != null ? struct : Types.StructType.of(Lists.newArrayList());

Review comment:
       Tests for empty projections?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#discussion_r470923026



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetSchemaUtil.java
##########
@@ -41,8 +43,29 @@ public static MessageType convert(Schema schema, String name) {
     return new TypeToMessageType().convert(schema, name);
   }
 
+  /**
+   * Converts a Parquet schema to an Iceberg schema. Fields without IDs are kept and assigned fallback IDs.
+   *
+   * @param parquetSchema a Parquet schema
+   * @return a matching Iceberg schema for the provided Parquet schema
+   */
   public static Schema convert(MessageType parquetSchema) {
-    MessageTypeToType converter = new MessageTypeToType(parquetSchema);
+    // if the Parquet schema does not contain ids, we assign fallback ids to top-level fields
+    // all remaining fields will get ids >= 1000 to avoid pruning columns without ids
+    MessageType parquetSchemaWithIds = hasIds(parquetSchema) ? parquetSchema : addFallbackIds(parquetSchema);
+    AtomicInteger nextId = new AtomicInteger(1000);
+    return convert(parquetSchemaWithIds, name -> nextId.getAndIncrement());
+  }
+
+  /**
+   * Converts a Parquet schema to an Iceberg schema. Fields without IDs are pruned.
+   *
+   * @param parquetSchema a Parquet schema
+   * @param nameToIdFunc a name to field id mapping function
+   * @return a matching Iceberg schema for the provided Parquet schema
+   */
+  public static Schema convert(MessageType parquetSchema, Function<String[], Integer> nameToIdFunc) {

Review comment:
       Removed it from the public API.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#discussion_r470913511



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/MessageTypeToType.java
##########
@@ -56,36 +60,38 @@
 
   @Override
   public Type message(MessageType message, List<Type> fields) {
-    return struct(message, fields);
+    Type struct = struct(message, fields);
+    return struct != null ? struct : Types.StructType.of(Lists.newArrayList());

Review comment:
       This part requires attention. Tests for empty projections would fail otherwise.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#discussion_r470034727



##########
File path: core/src/main/java/org/apache/iceberg/mapping/MappedFields.java
##########
@@ -37,21 +38,21 @@ public static MappedFields of(List<MappedField> fields) {
   }
 
   private final List<MappedField> fields;
-  private final Map<String, Integer> nameToId;
-  private final Map<Integer, MappedField> idToField;
+  private transient Map<String, Integer> nameToId;
+  private transient Map<Integer, MappedField> idToField;
 
   private MappedFields(List<MappedField> fields) {
     this.fields = ImmutableList.copyOf(fields);
-    this.nameToId = indexIds(fields);
-    this.idToField = indexFields(fields);
+    lazyNameToId();

Review comment:
       I kept the initialization in place. It triggers validation. Otherwise, some tests fail.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#discussion_r470039388



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java
##########
@@ -136,6 +154,19 @@ public static Metrics footerMetrics(ParquetMetadata metadata, MetricsConfig metr
         toBufferMap(fileSchema, lowerBounds), toBufferMap(fileSchema, upperBounds));
   }
 
+  private static MessageType getParquetType(ParquetMetadata metadata, Schema expectedSchema, NameMapping nameMapping) {
+    MessageType type = metadata.getFileMetaData().getSchema();
+    if (ParquetSchemaUtil.hasIds(type)) {
+      return expectedSchema != null ? ParquetSchemaUtil.pruneColumns(type, expectedSchema) : type;
+    } else if (nameMapping != null) {
+      MessageType typeWithIds = ParquetSchemaUtil.applyNameMapping(type, nameMapping);

Review comment:
       An alternative is to build a custom visitor that would combine logic from `ApplyNameMapping` and `PruneColumns`. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#discussion_r470039388



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java
##########
@@ -136,6 +154,19 @@ public static Metrics footerMetrics(ParquetMetadata metadata, MetricsConfig metr
         toBufferMap(fileSchema, lowerBounds), toBufferMap(fileSchema, upperBounds));
   }
 
+  private static MessageType getParquetType(ParquetMetadata metadata, Schema expectedSchema, NameMapping nameMapping) {
+    MessageType type = metadata.getFileMetaData().getSchema();
+    if (ParquetSchemaUtil.hasIds(type)) {
+      return expectedSchema != null ? ParquetSchemaUtil.pruneColumns(type, expectedSchema) : type;
+    } else if (nameMapping != null) {
+      MessageType typeWithIds = ParquetSchemaUtil.applyNameMapping(type, nameMapping);

Review comment:
       An alternative is to build a custom visitor and combine logic from `ApplyNameMapping` and `PruneColumns`. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#issuecomment-673552161


   Could you take a look, @rdsr @shardulm94 @rdblue, @chenjunjiedada? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#discussion_r470917509



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/MessageTypeToType.java
##########
@@ -56,36 +60,38 @@
 
   @Override
   public Type message(MessageType message, List<Type> fields) {
-    return struct(message, fields);
+    Type struct = struct(message, fields);
+    return struct != null ? struct : Types.StructType.of(Lists.newArrayList());
   }
 
   @Override
   public Type struct(GroupType struct, List<Type> fieldTypes) {
-    if (struct == root) {
-      nextId = 1; // use the reserved IDs for the root struct
-    }
-
     List<org.apache.parquet.schema.Type> parquetFields = struct.getFields();
-    List<Types.NestedField> fields = Lists.newArrayListWithExpectedSize(fieldTypes.size());
+    List<Types.NestedField> fields = Lists.newArrayList();

Review comment:
       I think it would be fine to continue to use `withExpectedSize` because the size will actually be smaller.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#discussion_r470201424



##########
File path: core/src/main/java/org/apache/iceberg/mapping/NameMapping.java
##########
@@ -27,7 +28,7 @@
 /**
  * Represents a mapping from external schema names to Iceberg type IDs.
  */
-public class NameMapping {
+public class NameMapping implements Serializable {

Review comment:
       Thanks! I don't think `Schema` needs to be updated, since it only uses Guava classes for `transient` fields that get re-initialized after serialization.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#discussion_r470918168



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java
##########
@@ -86,15 +95,23 @@ public static Metrics footerMetrics(ParquetMetadata metadata, MetricsConfig metr
     Map<Integer, Literal<?>> upperBounds = Maps.newHashMap();
     Set<Integer> missingStats = Sets.newHashSet();
 
-    MessageType parquetType = metadata.getFileMetaData().getSchema();
-    Schema fileSchema = ParquetSchemaUtil.convert(parquetType);
+    // ignore metrics for fields we failed to determine reliable IDs
+    MessageType parquetTypeWithIds = getParquetTypeWithIds(metadata, nameMapping);
+    Schema fileSchema = ParquetSchemaUtil.convert(parquetTypeWithIds, name -> null);

Review comment:
       Actually, I guess that isn't really the same thing because pruning will remove any fields with no children. This is probably okay, but it would be nice to have a name for this. What about `convertTopLevel`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#issuecomment-674326047


   +1, I just had minor comments.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#discussion_r470245256



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java
##########
@@ -136,6 +154,19 @@ public static Metrics footerMetrics(ParquetMetadata metadata, MetricsConfig metr
         toBufferMap(fileSchema, lowerBounds), toBufferMap(fileSchema, upperBounds));
   }
 
+  private static MessageType getParquetType(ParquetMetadata metadata, Schema expectedSchema, NameMapping nameMapping) {
+    MessageType type = metadata.getFileMetaData().getSchema();
+    if (ParquetSchemaUtil.hasIds(type)) {
+      return expectedSchema != null ? ParquetSchemaUtil.pruneColumns(type, expectedSchema) : type;
+    } else if (nameMapping != null) {
+      MessageType typeWithIds = ParquetSchemaUtil.applyNameMapping(type, nameMapping);

Review comment:
       Agree, let me update this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#discussion_r470250872



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java
##########
@@ -136,6 +154,19 @@ public static Metrics footerMetrics(ParquetMetadata metadata, MetricsConfig metr
         toBufferMap(fileSchema, lowerBounds), toBufferMap(fileSchema, upperBounds));
   }
 
+  private static MessageType getParquetType(ParquetMetadata metadata, Schema expectedSchema, NameMapping nameMapping) {
+    MessageType type = metadata.getFileMetaData().getSchema();
+    if (ParquetSchemaUtil.hasIds(type)) {
+      return expectedSchema != null ? ParquetSchemaUtil.pruneColumns(type, expectedSchema) : type;
+    } else if (nameMapping != null) {
+      MessageType typeWithIds = ParquetSchemaUtil.applyNameMapping(type, nameMapping);

Review comment:
       Actually, I remember one point I forgot. We cannot have a version of `convert` that does not assign ids as we must have an id for every column in Iceberg schema. It seems what we want is a way to prune columns without ids.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #1335: Spark: Follow name mapping while importing Parquet tables

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #1335:
URL: https://github.com/apache/iceberg/pull/1335#discussion_r470036193



##########
File path: core/src/main/java/org/apache/iceberg/mapping/NameMapping.java
##########
@@ -27,7 +28,7 @@
 /**
  * Represents a mapping from external schema names to Iceberg type IDs.
  */
-public class NameMapping {
+public class NameMapping implements Serializable {

Review comment:
       While we are using guava classes, I think it should be ok as Spark uses Java serialization for closures.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org