You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/04/28 12:24:30 UTC

[GitHub] [iceberg] chenjunjiedada opened a new pull request #2538: Core: Add delete marker metadata column

chenjunjiedada opened a new pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538


   This adds a metadata column to indicate whether a row is deleted or not.  A delete marker column can be used when finding the deleted rows as we discussed in https://github.com/apache/iceberg/pull/2372, it can also be used to simplify the overall merge on read process.  In order to avoid overhead, the delete marker metadata column only projected when the delete files exist.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#discussion_r637261502



##########
File path: data/src/test/java/org/apache/iceberg/data/DeleteReadTests.java
##########
@@ -329,7 +329,7 @@ private StructLikeSet rowSetWithoutIds(int... idsToRemove) {
     return set;
   }
 
-  protected StructLikeSet rowSetWitIds(int... idsToRetain) {
+  protected StructLikeSet rowSetWithIds(int... idsToRetain) {

Review comment:
       Can you fix this in a separate PR? This file doesn't need to change and it could cause commit conflicts.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on a change in pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#discussion_r639361132



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueReaders.java
##########
@@ -168,6 +172,27 @@ public void setPageSource(PageReadStore pageStore, long rowPosition) {
     }
   }
 
+  static class DeleteMarkerReader implements ParquetValueReader<Boolean> {

Review comment:
       Yes, we could use the existing constant reader.  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#issuecomment-848412266


   Thanks @rdblue for reviewing! I addressed your comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#discussion_r637261871



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ParquetValueReaders.java
##########
@@ -168,6 +172,27 @@ public void setPageSource(PageReadStore pageStore, long rowPosition) {
     }
   }
 
+  static class DeleteMarkerReader implements ParquetValueReader<Boolean> {

Review comment:
       Isn't there a constant reader that we can reuse?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#issuecomment-864322056


   Thanks for the reviewing and merging! @rdblue @jackye1995 @yyanyy !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#discussion_r652270736



##########
File path: core/src/main/java/org/apache/iceberg/avro/ValueReaders.java
##########
@@ -748,4 +753,17 @@ public Long read(Decoder ignored, Object reuse) throws IOException {
       return currentPosition;
     }
   }
+
+  static class ConstantReader<C> implements ValueReader<C> {

Review comment:
       @chenjunjiedada, since this is no longer used, can you please remove it?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#issuecomment-861728176


   Thanks, @chenjunjiedada. I'll take a look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#issuecomment-845518944


   overall looks good to me, I left some discussion comments in the original PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#discussion_r637264005



##########
File path: flink/src/test/java/org/apache/iceberg/flink/SimpleDataUtil.java
##########
@@ -191,9 +190,19 @@ public static void assertTableRows(Table table, List<RowData> expected) throws I
 
   public static void assertTableRecords(Table table, List<Record> expected) throws IOException {
     table.refresh();
+

Review comment:
       I also find this suspicious. Is the extra column in the _expected_ records or the table? I don't think that this PR should change the data produced by `IcebergGenerics.read(table).build()`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#issuecomment-828427343


   To be clear, this PR is to enable reading of data while optionally marking rows as deleted rather than actually filtering them out so the delete information can be used in other utilities to rewrite delete files into more compact forms?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#issuecomment-864320036


   Thanks for pinging me, @chenjunjiedada! This looks good now, I'll merge it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on a change in pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#discussion_r652255491



##########
File path: core/src/main/java/org/apache/iceberg/avro/ValueReaders.java
##########
@@ -609,6 +611,8 @@ protected StructReader(List<ValueReader<?>> readers, Types.StructType struct, Ma
         } else if (field.fieldId() == MetadataColumns.ROW_POSITION.fieldId()) {
           // track where the _pos field is located for setRowPositionSupplier
           this.posField = pos;
+        } else if (field.fieldId() == MetadataColumns.IS_DELETED.fieldId()) {
+          this.readers[pos] = new ConstantReader<>(false);

Review comment:
       Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#issuecomment-834036022


   cc @jackye1995 as well. 
   
   I added a couple of unit tests for this, more unit tests could be added in the deletes reader PR. @RussellSpitzer @yyanyy @rdblue @openinx, could you please help to take a look when you have time?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#discussion_r637262048



##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/vectorized/RowPositionColumnVector.java
##########
@@ -27,11 +27,11 @@
 import org.apache.spark.sql.vectorized.ColumnarMap;
 import org.apache.spark.unsafe.types.UTF8String;
 
-public class RowPostitionColumnVector extends ColumnVector {
+public class RowPositionColumnVector extends ColumnVector {
 
   private final long batchOffsetInFile;
 
-  RowPostitionColumnVector(long batchOffsetInFile) {
+  RowPositionColumnVector(long batchOffsetInFile) {

Review comment:
       Can we fix typos in a separate PR? This is already touching quite a few files and this file doesn't need to change.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on a change in pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#discussion_r623548330



##########
File path: data/src/main/java/org/apache/iceberg/data/DeleteFilter.java
##########
@@ -253,8 +255,8 @@ private static Schema fileProjection(Schema tableSchema, Schema requestedSchema,
     // TODO: support adding nested columns. this will currently fail when finding nested columns to add
     List<Types.NestedField> columns = Lists.newArrayList(requestedSchema.columns());
     for (int fieldId : missingIds) {

Review comment:
       I think the logic here is if the `requestedSchema` doesn't contain the metadata column, we add them to the end. the `requiredIds` already has metadata columns.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#discussion_r623514621



##########
File path: data/src/main/java/org/apache/iceberg/data/DeleteFilter.java
##########
@@ -253,8 +255,8 @@ private static Schema fileProjection(Schema tableSchema, Schema requestedSchema,
     // TODO: support adding nested columns. this will currently fail when finding nested columns to add
     List<Types.NestedField> columns = Lists.newArrayList(requestedSchema.columns());
     for (int fieldId : missingIds) {

Review comment:
       should we add this new metadata column to `requiredIds` above? 

##########
File path: flink/src/test/java/org/apache/iceberg/flink/SimpleDataUtil.java
##########
@@ -191,9 +190,19 @@ public static void assertTableRows(Table table, List<RowData> expected) throws I
 
   public static void assertTableRecords(Table table, List<Record> expected) throws IOException {
     table.refresh();
+

Review comment:
       Why do we need this change? is it because two sets don't match after we add the additional metadata column? but wouldn't `_pos` and `_file` also have this issue?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#issuecomment-863628683


   @rdblue , The last comment was addressed, could we merge this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on a change in pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#discussion_r639360875



##########
File path: core/src/main/java/org/apache/iceberg/MetadataColumns.java
##########
@@ -36,6 +36,8 @@ private MetadataColumns() {
       Integer.MAX_VALUE - 1, "_file", Types.StringType.get(), "Path of the file in which a row is stored");
   public static final NestedField ROW_POSITION = NestedField.required(
       Integer.MAX_VALUE - 2, "_pos", Types.LongType.get(), "Ordinal position of a row in the source data file");
+  public static final NestedField DELETE_MARK = NestedField.required(

Review comment:
       Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#issuecomment-828412403


   This is a separate part from https://github.com/apache/iceberg/pull/2372. cc @openinx @rdblue @aokolnychyi @RussellSpitzer @yyanyy 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#issuecomment-828449508


   @RussellSpitzer you are right! With a delete marker column, we can also simplify the current filtering logic a bit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#discussion_r652129230



##########
File path: core/src/main/java/org/apache/iceberg/avro/ValueReaders.java
##########
@@ -609,6 +611,8 @@ protected StructReader(List<ValueReader<?>> readers, Types.StructType struct, Ma
         } else if (field.fieldId() == MetadataColumns.ROW_POSITION.fieldId()) {
           // track where the _pos field is located for setRowPositionSupplier
           this.posField = pos;
+        } else if (field.fieldId() == MetadataColumns.IS_DELETED.fieldId()) {
+          this.readers[pos] = new ConstantReader<>(false);

Review comment:
       Constants are already handled by this class, see the first branch of this if/else logic. I think that it would make more sense to reuse that rather than create a new constant reader.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on a change in pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#discussion_r652274973



##########
File path: core/src/main/java/org/apache/iceberg/avro/ValueReaders.java
##########
@@ -591,6 +591,8 @@ protected StructReader(List<ValueReader<?>> readers, Schema schema) {
         if (AvroSchemaUtil.getFieldId(field) == MetadataColumns.ROW_POSITION.fieldId()) {
           // track where the _pos field is located for setRowPositionSupplier
           this.posField = pos;
+        } else if (AvroSchemaUtil.getFieldId(field) == MetadataColumns.IS_DELETED.fieldId()) {
+          this.readers[pos] = new ConstantReader<>(false);

Review comment:
       Oops, my brain was not working well this early morning... Let me take a coffee first.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#issuecomment-861814718


   @chenjunjiedada, the constant handling in Avro appears to be the only remaining issue. Once you update that, I'll merge this. Thanks for working on it!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#issuecomment-861101630


   @rdblue Is this ready to merge? The delete rewrites may depend on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#discussion_r652270970



##########
File path: core/src/main/java/org/apache/iceberg/avro/ValueReaders.java
##########
@@ -591,6 +591,8 @@ protected StructReader(List<ValueReader<?>> readers, Schema schema) {
         if (AvroSchemaUtil.getFieldId(field) == MetadataColumns.ROW_POSITION.fieldId()) {
           // track where the _pos field is located for setRowPositionSupplier
           this.posField = pos;
+        } else if (AvroSchemaUtil.getFieldId(field) == MetadataColumns.IS_DELETED.fieldId()) {
+          this.readers[pos] = new ConstantReader<>(false);

Review comment:
       Can you convert this to use positions and constants as well?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on a change in pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#discussion_r652292541



##########
File path: core/src/main/java/org/apache/iceberg/avro/ValueReaders.java
##########
@@ -748,4 +753,17 @@ public Long read(Decoder ignored, Object reuse) throws IOException {
       return currentPosition;
     }
   }
+
+  static class ConstantReader<C> implements ValueReader<C> {

Review comment:
       Sure,I should remove this first to track all its usage. I believe now no one is using it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#discussion_r652270736



##########
File path: core/src/main/java/org/apache/iceberg/avro/ValueReaders.java
##########
@@ -748,4 +753,17 @@ public Long read(Decoder ignored, Object reuse) throws IOException {
       return currentPosition;
     }
   }
+
+  static class ConstantReader<C> implements ValueReader<C> {

Review comment:
       @chenjunjiedada, since this should no longer be needed, can you please remove it?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on a change in pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#discussion_r637339281



##########
File path: flink/src/test/java/org/apache/iceberg/flink/SimpleDataUtil.java
##########
@@ -191,9 +190,19 @@ public static void assertTableRows(Table table, List<RowData> expected) throws I
 
   public static void assertTableRecords(Table table, List<Record> expected) throws IOException {
     table.refresh();
+

Review comment:
       It is in _expected records_. The extra column is added in `DeleteFileter#fileProjection`.
   
   ```
   private static Schema fileProjection(Schema tableSchema, Schema requestedSchema,
                                          List<DeleteFile> posDeletes, List<DeleteFile> eqDeletes) {
       ...
       // We add it to requiredIds, so that it exists in missingIds when requestedSchema doesn't contain it.
       if (!posDeletes.isEmpty()) {
         requiredIds.add(MetadataColumns.ROW_POSITION.fieldId());
       }
       ....
   
   
       // We append it at the end anyway.
       if (missingIds.contains(MetadataColumns.ROW_POSITION.fieldId())) {
         columns.add(MetadataColumns.ROW_POSITION);
       }
   
       return new Schema(columns);
     }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on a change in pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#discussion_r623547421



##########
File path: flink/src/test/java/org/apache/iceberg/flink/SimpleDataUtil.java
##########
@@ -191,9 +190,19 @@ public static void assertTableRows(Table table, List<RowData> expected) throws I
 
   public static void assertTableRecords(Table table, List<Record> expected) throws IOException {
     table.refresh();
+

Review comment:
       Because the metadata column projection logic produces additional columns even when the `requestedSchema` doesn't contain them and that is why we use `StructLikeSet`.  The `_pos` column shows up only when positional deletes exist,  the `_deleted` marker shows up when any of the deletes exist.
   
   The failed unit test contains only equality delete which produces only `_deleted` column, so it failed with `HashMultiSet` comparison. But when the unit test, for example `TestIcebergFilesCommitter.TestCommitTwoCheckpointsInSingleTxn`, contains a positional delete, the unit test fails as well due to it has the additional column `_pos`. The following patch for the unit test could test it.
   ```
   -      DeleteFile deleteFile1 = writeEqDeleteFile(appenderFactory, "delete-file-1", ImmutableList.of(delete3));
   +      DeleteFile deleteFile1 = writePosDeleteFile(appenderFactory,
   +          "pos-delete-file-1",
   +          ImmutableList.of(Pair.of(dataFile1.path(), 3L)));
   ```
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on a change in pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#discussion_r623547421



##########
File path: flink/src/test/java/org/apache/iceberg/flink/SimpleDataUtil.java
##########
@@ -191,9 +190,19 @@ public static void assertTableRows(Table table, List<RowData> expected) throws I
 
   public static void assertTableRecords(Table table, List<Record> expected) throws IOException {
     table.refresh();
+

Review comment:
       Because the metadata column projection logic produces additional columns even when the `requiredSchema` doesn't contain them and that is why we use `StructLikeSet`.  The `_pos` column shows up only when positional deletes exist,  the `_deleted` marker shows up when any of the deletes exist.
   
   The failed unit test contains only equality delete which produces only `_deleted` column, so it failed with `HashMultiSet` comparison. But when the unit test, for example `TestIcebergFilesCommitter.TestCommitTwoCheckpointsInSingleTxn`, contains a positional delete, the unit test fails as well due to it has the additional column `_pos`. The following patch for the unit test could test it.
   ```
   -      DeleteFile deleteFile1 = writeEqDeleteFile(appenderFactory, "delete-file-1", ImmutableList.of(delete3));
   +      DeleteFile deleteFile1 = writePosDeleteFile(appenderFactory,
   +          "pos-delete-file-1",
   +          ImmutableList.of(Pair.of(dataFile1.path(), 3L)));
   ```
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on a change in pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#discussion_r639361455



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcValueReaders.java
##########
@@ -233,4 +236,16 @@ public void setBatchContext(long newBatchOffsetInFile) {
       this.batchOffsetInFile = newBatchOffsetInFile;
     }
   }
+
+  private static class DeleteMarkReader implements OrcValueReader<Boolean> {

Review comment:
       We could use the existing constant reader from parquet and orc readers. I also created a new constant reader for Avro reader.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#issuecomment-845924223


   Thanks @jackye1995 !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#discussion_r637262353



##########
File path: orc/src/main/java/org/apache/iceberg/orc/OrcValueReaders.java
##########
@@ -233,4 +236,16 @@ public void setBatchContext(long newBatchOffsetInFile) {
       this.batchOffsetInFile = newBatchOffsetInFile;
     }
   }
+
+  private static class DeleteMarkReader implements OrcValueReader<Boolean> {

Review comment:
       Is there a constant reader that could be reused?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2538: Core: Add delete marker metadata column

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2538:
URL: https://github.com/apache/iceberg/pull/2538#discussion_r637261072



##########
File path: core/src/main/java/org/apache/iceberg/MetadataColumns.java
##########
@@ -36,6 +36,8 @@ private MetadataColumns() {
       Integer.MAX_VALUE - 1, "_file", Types.StringType.get(), "Path of the file in which a row is stored");
   public static final NestedField ROW_POSITION = NestedField.required(
       Integer.MAX_VALUE - 2, "_pos", Types.LongType.get(), "Ordinal position of a row in the source data file");
+  public static final NestedField DELETE_MARK = NestedField.required(

Review comment:
       Instead of `DELETE_MARK`, how about `IS_DELETED`? I don't think that "mark" is clear enough to describe what this is. Similarly, I think the docs should be "Whether the row has been deleted". There's no need to include "delete mark" because that's identifying something that is not defined (this column is _deleted and "mark" is not introduced), and "or not" is unnecessary because it is implied by "whether".




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org