You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/11/04 04:02:12 UTC

[GitHub] [iceberg] zhongyujiang opened a new pull request, #6118: Parquet, Core: Fix collection of Parquet metrics when column names co…

zhongyujiang opened a new pull request, #6118:
URL: https://github.com/apache/iceberg/pull/6118

   …ntain special characters.
   
   Iceberg escape special characters in field names when converting Schema to Parquet MessageType or Avro Schema: 
   https://github.com/apache/iceberg/blob/7a247fdb44110363bd9e87d1ace91497fd72ba92/core/src/main/java/org/apache/iceberg/avro/AvroSchemaUtil.java#L429-L434
   so the MetricsMode obtained by the field name in the file schema might be wrong if the column name is an escaped one. This pr updates MetricsConfig to use column id to track MetricsMode to solve this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zhongyujiang commented on pull request #6118: Parquet, Core: Fix collection of Parquet metrics when column names co…

Posted by GitBox <gi...@apache.org>.
zhongyujiang commented on PR #6118:
URL: https://github.com/apache/iceberg/pull/6118#issuecomment-1305081026

   @rdblue sure.
   When collecting metrics from Parquet footer, Iceberg [converts](https://github.com/apache/iceberg/blob/167a8ccd7c578296c40f8fc61c90135e71cf1183/parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java#L107) the file MessageType to an Iceberg Schema and [uses](https://github.com/apache/iceberg/blob/167a8ccd7c578296c40f8fc61c90135e71cf1183/core/src/main/java/org/apache/iceberg/MetricsUtil.java#L56) this schema to get the column name of an field id it mapping, and then uses the obtained field name to get its corresponding metric mode. 
   
   However Iceberg will escape special characters in field names when converting an Iceberg Schema to an Parquet MessageType, and those escaped names cannot be restored when converting an Parquet MessageType back to an Iceberg Schema, that is to say, we are now using those escaped column names to get their corresponding metric modes, which may resulted in incorrect results since those escaped names cannot be recognized by MetricsConfig. 
   
   The ORC path does not have this problem because special characters are not escaped when converting to ORC schema and ORC [itself](https://lists.apache.org/thread/93xbnbs0mr0zxx4fzvrz10m5mmd4qb5w) can handle any UTF-8 characters in the column names.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6118: Parquet, Core: Fix collection of Parquet metrics when column names co…

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #6118:
URL: https://github.com/apache/iceberg/pull/6118#issuecomment-1304892034

   @zhongyujiang can you explain the fix a bit more clearly?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zhongyujiang commented on pull request #6118: Parquet, Core: Fix collection of Parquet metrics when column names co…

Posted by GitBox <gi...@apache.org>.
zhongyujiang commented on PR #6118:
URL: https://github.com/apache/iceberg/pull/6118#issuecomment-1305083278

   The failure seems unrelated to this PR:
   >* What went wrong:
   >Could not determine the dependencies of task ':iceberg-flink:iceberg-flink-runtime-1.16:shadowJar'.
   >See https://docs.gradle.org/7.5.1/userguide/command_line_interface.html#sec:command_line_warnings
   > Could not resolve all dependencies for configuration ':iceberg-flink:iceberg-flink-runtime-1.16:runtimeClasspath'.
      > Could not resolve org.apache.flink:flink-metrics-dropwizard:1.16.0.
        Required by:
            project :iceberg-flink:iceberg-flink-runtime-1.16
         > Could not resolve org.apache.flink:flink-metrics-dropwizard:1.16.0.
            > Could not get resource 'https://repo.maven.apache.org/maven2/org/apache/flink/flink-metrics-dropwizard/1.16.0/flink-metrics-dropwizard-1.16.0.pom'.
               > Could not GET 'https://repo.maven.apache.org/maven2/org/apache/flink/flink-metrics-dropwizard/1.16.0/flink-metrics-dropwizard-1.16.0.pom'.
                  > Connect to repo.maven.apache.org:4[43](https://github.com/apache/iceberg/actions/runs/3407136353/jobs/5666470632#step:4:44) [repo.maven.apache.org/146.75.28.215] failed: connect timed out 
   >


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zhongyujiang commented on a diff in pull request #6118: Parquet, Core: Fix collection of Parquet metrics when column names co…

Posted by GitBox <gi...@apache.org>.
zhongyujiang commented on code in PR #6118:
URL: https://github.com/apache/iceberg/pull/6118#discussion_r1015035815


##########
parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java:
##########
@@ -75,23 +76,27 @@ public static Metrics fileMetrics(InputFile file, MetricsConfig metricsConfig) {
   public static Metrics fileMetrics(
       InputFile file, MetricsConfig metricsConfig, NameMapping nameMapping) {
     try (ParquetFileReader reader = ParquetFileReader.open(ParquetIO.file(file))) {
-      return footerMetrics(reader.getFooter(), Stream.empty(), metricsConfig, nameMapping);
+      return footerMetrics(reader.getFooter(), Stream.empty(), metricsConfig, nameMapping, null);

Review Comment:
   This is used to migrate external files to Iceberg tables,  I am thinking if those files are generated by Iceberg, we may still not be able to collect metrics according to the correct metric mode with current change. 
   
   To solve this , I think we can change TableMigrationUtil to pass Iceberg schema also; 
   Or we can update MetricsConfig to make it tracks mapping between column ids metric modes, and use column ids to get metric modes directly when collecting parquet metrics, but that will need all writers to update from `MetricsConfig.fromProperties(Map<String, String> props)` to `MetricsConfig from(Map<String, String> props, Schema schema, SortOrder order)` (this is a private method now). 
   
   I prefer the second way and I can update the releated code in this repo, but I don't know if any external projects uses `MetricsConfig.fromProperties(Map<String, String> props)` too. What do you think about it? @rdblue 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org