You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/07/22 17:00:20 UTC

[GitHub] [iceberg] aokolnychyi opened a new pull request #2851: Core: Add delete table properties for Avro and Parquet

aokolnychyi opened a new pull request #2851:
URL: https://github.com/apache/iceberg/pull/2851


    This PR adds table properties to configure delete Avro and Parquet writers independently of data writers.
   
   It is a subset of the changes in #2827.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on pull request #2851: Core: Add delete table properties for Avro and Parquet

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on pull request #2851:
URL: https://github.com/apache/iceberg/pull/2851#issuecomment-885923681


   Thanks, everyone! I merged this to unblock subsequent PRs.
   Let me know if I missed any points. I'll address them in a separate PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2851: Core: Add delete table properties for Avro and Parquet

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2851:
URL: https://github.com/apache/iceberg/pull/2851#discussion_r675353080



##########
File path: core/src/main/java/org/apache/iceberg/TableProperties.java
##########
@@ -142,6 +149,7 @@ private TableProperties() {
   public static final String WRITE_AUDIT_PUBLISH_ENABLED_DEFAULT = "false";
 
   public static final String WRITE_TARGET_FILE_SIZE_BYTES = "write.target-file-size-bytes";
+  public static final String DELETE_TARGET_FILE_SIZE_BYTES = "write.delete.target-file-size-bytes";

Review comment:
       So , are we planning to adjust the delete writer's target-file-size-bytes in the following PR ?  I see this property is referenced by nobody.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2851: Core: Add delete table properties for Avro and Parquet

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2851:
URL: https://github.com/apache/iceberg/pull/2851#discussion_r675726439



##########
File path: core/src/main/java/org/apache/iceberg/TableProperties.java
##########
@@ -142,6 +149,7 @@ private TableProperties() {
   public static final String WRITE_AUDIT_PUBLISH_ENABLED_DEFAULT = "false";
 
   public static final String WRITE_TARGET_FILE_SIZE_BYTES = "write.target-file-size-bytes";
+  public static final String DELETE_TARGET_FILE_SIZE_BYTES = "write.delete.target-file-size-bytes";

Review comment:
       I have a number of follow-up PRs. Just want to keep the scope of this change small.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2851: Core: Add delete table properties for Avro and Parquet

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2851:
URL: https://github.com/apache/iceberg/pull/2851#discussion_r674991118



##########
File path: core/src/main/java/org/apache/iceberg/TableProperties.java
##########
@@ -58,24 +58,31 @@ private TableProperties() {
   public static final boolean MANIFEST_MERGE_ENABLED_DEFAULT = true;
 
   public static final String DEFAULT_FILE_FORMAT = "write.format.default";
+  public static final String DELETE_DEFAULT_FILE_FORMAT = "write.delete.format.default";

Review comment:
       @jackye1995, I've removed the prefix like discussed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2851: Core: Add delete table properties for Avro and Parquet

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2851:
URL: https://github.com/apache/iceberg/pull/2851#discussion_r675726027



##########
File path: core/src/main/java/org/apache/iceberg/avro/Avro.java
##########
@@ -190,8 +189,48 @@ private CodecFactory codec() {
       // add the Iceberg schema to keyValueMetadata
       meta("iceberg.schema", SchemaParser.toJson(schema));
 
+      Context context = createContextFunc.apply(config);
+      CodecFactory codec = context.codec();
+
       return new AvroFileAppender<>(
-          schema, AvroSchemaUtil.convert(schema, name), file, writerFunc, codec(), metadata, metricsConfig, overwrite);
+          schema, AvroSchemaUtil.convert(schema, name), file, writerFunc, codec, metadata, metricsConfig, overwrite);
+    }
+
+    private static class Context {

Review comment:
       Sure, let me update that.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2851: Core: Add delete table properties for Avro and Parquet

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2851:
URL: https://github.com/apache/iceberg/pull/2851#discussion_r675783289



##########
File path: core/src/main/java/org/apache/iceberg/avro/Avro.java
##########
@@ -190,8 +189,48 @@ private CodecFactory codec() {
       // add the Iceberg schema to keyValueMetadata
       meta("iceberg.schema", SchemaParser.toJson(schema));
 
+      Context context = createContextFunc.apply(config);
+      CodecFactory codec = context.codec();
+
       return new AvroFileAppender<>(
-          schema, AvroSchemaUtil.convert(schema, name), file, writerFunc, codec(), metadata, metricsConfig, overwrite);
+          schema, AvroSchemaUtil.convert(schema, name), file, writerFunc, codec, metadata, metricsConfig, overwrite);
+    }
+
+    private static class Context {

Review comment:
       Fixed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2851: Core: Add delete table properties for Avro and Parquet

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2851:
URL: https://github.com/apache/iceberg/pull/2851#issuecomment-885923681


   Thanks, everyone! I merged this to unblock subsequent PRs.
   Let me know if I missed any points. I'll address them using a separate PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2851: Core: Add delete table properties for Avro and Parquet

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2851:
URL: https://github.com/apache/iceberg/pull/2851#discussion_r675726536



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java
##########
@@ -280,6 +283,90 @@ WriteBuilder withWriterVersion(WriterVersion version) {
             metricsConfig);
       }
     }
+
+    private static class Context {

Review comment:
       Will update.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2851: Core: Add delete table properties for Avro and Parquet

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2851:
URL: https://github.com/apache/iceberg/pull/2851#discussion_r675783441



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java
##########
@@ -280,6 +283,90 @@ WriteBuilder withWriterVersion(WriterVersion version) {
             metricsConfig);
       }
     }
+
+    private static class Context {

Review comment:
       Fixed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #2851: Core: Add delete table properties for Avro and Parquet

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on pull request #2851:
URL: https://github.com/apache/iceberg/pull/2851#issuecomment-885068899


   cc @openinx @rdblue @RussellSpitzer @yyanyy @jackye1995


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2851: Core: Add delete table properties for Avro and Parquet

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2851:
URL: https://github.com/apache/iceberg/pull/2851#discussion_r675359681



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java
##########
@@ -280,6 +283,90 @@ WriteBuilder withWriterVersion(WriterVersion version) {
             metricsConfig);
       }
     }
+
+    private static class Context {

Review comment:
       Nit:  I have the similar [comment](https://github.com/apache/iceberg/pull/2851/files#r675352093) for this class `Context`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2851: Core: Add delete table properties for Avro and Parquet

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2851:
URL: https://github.com/apache/iceberg/pull/2851#discussion_r675352093



##########
File path: core/src/main/java/org/apache/iceberg/avro/Avro.java
##########
@@ -190,8 +189,48 @@ private CodecFactory codec() {
       // add the Iceberg schema to keyValueMetadata
       meta("iceberg.schema", SchemaParser.toJson(schema));
 
+      Context context = createContextFunc.apply(config);
+      CodecFactory codec = context.codec();
+
       return new AvroFileAppender<>(
-          schema, AvroSchemaUtil.convert(schema, name), file, writerFunc, codec(), metadata, metricsConfig, overwrite);
+          schema, AvroSchemaUtil.convert(schema, name), file, writerFunc, codec, metadata, metricsConfig, overwrite);
+    }
+
+    private static class Context {

Review comment:
       Nit:  This class `Context` is private static class right, so I think all the following public methods can be marked as package-private, right ?

##########
File path: core/src/main/java/org/apache/iceberg/avro/Avro.java
##########
@@ -190,8 +189,48 @@ private CodecFactory codec() {
       // add the Iceberg schema to keyValueMetadata
       meta("iceberg.schema", SchemaParser.toJson(schema));
 
+      Context context = createContextFunc.apply(config);
+      CodecFactory codec = context.codec();
+
       return new AvroFileAppender<>(
-          schema, AvroSchemaUtil.convert(schema, name), file, writerFunc, codec(), metadata, metricsConfig, overwrite);
+          schema, AvroSchemaUtil.convert(schema, name), file, writerFunc, codec, metadata, metricsConfig, overwrite);
+    }
+
+    private static class Context {

Review comment:
       Nit:  This class `Context` is private static class, so I think all the following public methods can be marked as package-private, right ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] openinx commented on a change in pull request #2851: Core: Add delete table properties for Avro and Parquet

Posted by GitBox <gi...@apache.org>.
openinx commented on a change in pull request #2851:
URL: https://github.com/apache/iceberg/pull/2851#discussion_r675348352



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java
##########
@@ -280,6 +283,90 @@ WriteBuilder withWriterVersion(WriterVersion version) {
             metricsConfig);
       }
     }
+
+    private static class Context {
+      private final int rowGroupSize;
+      private final int pageSize;
+      private final int dictionaryPageSize;
+      private final CompressionCodecName codec;
+      private final String compressionLevel;
+
+      private Context(int rowGroupSize, int pageSize, int dictionaryPageSize,
+                      CompressionCodecName codec, String compressionLevel) {
+        this.rowGroupSize = rowGroupSize;
+        this.pageSize = pageSize;
+        this.dictionaryPageSize = dictionaryPageSize;
+        this.codec = codec;
+        this.compressionLevel = compressionLevel;
+      }
+
+      public static Context dataContext(Map<String, String> config) {
+        int rowGroupSize = Integer.parseInt(config.getOrDefault(
+            PARQUET_ROW_GROUP_SIZE_BYTES, PARQUET_ROW_GROUP_SIZE_BYTES_DEFAULT));
+
+        int pageSize = Integer.parseInt(config.getOrDefault(
+            PARQUET_PAGE_SIZE_BYTES, PARQUET_PAGE_SIZE_BYTES_DEFAULT));
+
+        int dictionaryPageSize = Integer.parseInt(config.getOrDefault(
+            PARQUET_DICT_SIZE_BYTES, PARQUET_DICT_SIZE_BYTES_DEFAULT));

Review comment:
       Nit:  could we align this value parser with the following `deleteContext` by using `PropertyUtil.propertyAsInt` ? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi merged pull request #2851: Core: Add delete table properties for Avro and Parquet

Posted by GitBox <gi...@apache.org>.
aokolnychyi merged pull request #2851:
URL: https://github.com/apache/iceberg/pull/2851


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2851: Core: Add delete table properties for Avro and Parquet

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2851:
URL: https://github.com/apache/iceberg/pull/2851#discussion_r675249717



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java
##########
@@ -262,7 +265,7 @@ WriteBuilder withWriterVersion(WriterVersion version) {
             .build();
 
         return new org.apache.iceberg.parquet.ParquetWriter<>(

Review comment:
       Shouldn't we get writer classes now that take a "context" as a creation arg?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2851: Core: Add delete table properties for Avro and Parquet

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2851:
URL: https://github.com/apache/iceberg/pull/2851#discussion_r675725474



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java
##########
@@ -280,6 +283,90 @@ WriteBuilder withWriterVersion(WriterVersion version) {
             metricsConfig);
       }
     }
+
+    private static class Context {
+      private final int rowGroupSize;
+      private final int pageSize;
+      private final int dictionaryPageSize;
+      private final CompressionCodecName codec;
+      private final String compressionLevel;
+
+      private Context(int rowGroupSize, int pageSize, int dictionaryPageSize,
+                      CompressionCodecName codec, String compressionLevel) {
+        this.rowGroupSize = rowGroupSize;
+        this.pageSize = pageSize;
+        this.dictionaryPageSize = dictionaryPageSize;
+        this.codec = codec;
+        this.compressionLevel = compressionLevel;
+      }
+
+      public static Context dataContext(Map<String, String> config) {
+        int rowGroupSize = Integer.parseInt(config.getOrDefault(
+            PARQUET_ROW_GROUP_SIZE_BYTES, PARQUET_ROW_GROUP_SIZE_BYTES_DEFAULT));
+
+        int pageSize = Integer.parseInt(config.getOrDefault(
+            PARQUET_PAGE_SIZE_BYTES, PARQUET_PAGE_SIZE_BYTES_DEFAULT));
+
+        int dictionaryPageSize = Integer.parseInt(config.getOrDefault(
+            PARQUET_DICT_SIZE_BYTES, PARQUET_DICT_SIZE_BYTES_DEFAULT));

Review comment:
       This is what I wanted to use too. Unfortunately, `PARQUET_ROW_GROUP_SIZE_BYTES_DEFAULT` is actually a String. That's why it is not possible to use `PropertyUtil`. Copied this code as it was in the original implementation.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2851: Core: Add delete table properties for Avro and Parquet

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2851:
URL: https://github.com/apache/iceberg/pull/2851#discussion_r675300959



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java
##########
@@ -262,7 +265,7 @@ WriteBuilder withWriterVersion(WriterVersion version) {
             .build();
 
         return new org.apache.iceberg.parquet.ParquetWriter<>(

Review comment:
       You mean pass a context to `ParquetWriter`? I am not a huge fan of the introduced context and I consider it very internal to `WriteBuilder` used to create `FileAppender`. I think it is a good idea to avoid leaking it further. Also, such writers have some sort of contexts already. For example, `ParquetProperties` instantiated a few lines above covers some of the properties but not all.
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org