You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/09/23 13:40:02 UTC

[GitHub] [iceberg] chenjunjiedada opened a new pull request #1497: MR: apply row-level delete files when reading

chenjunjiedada opened a new pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497


   This applies row-level delete files when reading for IcebergInputFormat. This also includes changes from https://github.com/apache/iceberg/pull/985.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r496275347



##########
File path: data/src/test/java/org/apache/iceberg/data/GenericReaderDeletesTest.java
##########
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.data;
+
+import java.io.File;
+import java.io.IOException;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TestTables;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.util.StructLikeSet;
+import org.junit.After;
+import org.junit.Before;
+
+public class GenericReaderDeletesTest extends DeletesReadTest {
+
+  @Override
+  public Table createTable(String name, Schema schema, PartitionSpec spec) throws IOException {
+    File tableDir = temp.newFolder();
+    tableDir.delete();
+
+    return TestTables.create(tableDir, name, schema, spec, 2);
+  }
+
+  @Override
+  public StructLikeSet rowSet(Table table, String... columns) throws IOException {
+    StructLikeSet set = StructLikeSet.create(table.schema().asStruct());
+    try (CloseableIterable<Record> reader = IcebergGenerics.read(table).select(columns).build()) {
+      reader.forEach(set::add);
+    }
+    return set;
+  }
+
+  @Before
+  public void writeTestDataFile() throws IOException {
+    this.table = createTable("test", SCHEMA, SPEC);
+    generateTestData();
+    table.newAppend()
+        .appendFile(dataFile)
+        .commit();

Review comment:
       Why is this `@Before` here and not in the base class?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

chenjunjiedada commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r494016238



##########
File path: data/src/test/java/org/apache/iceberg/data/GenericReaderDeletesTest.java
##########
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.data;
+
+import java.io.File;
+import java.io.IOException;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.TestTables;
+import org.apache.iceberg.types.Types;
+import org.junit.After;
+import org.junit.Before;
+
+import static org.apache.iceberg.types.Types.NestedField.required;
+
+public class GenericReaderDeletesTest extends DeletesReadTest {
+  // Schema passed to create tables
+  public static final Schema SCHEMA = new Schema(
+      required(1, "id", Types.IntegerType.get()),
+      required(2, "data", Types.StringType.get())
+  );

Review comment:
       Updated.

##########
File path: data/src/test/java/org/apache/iceberg/data/DeletesReadTest.java
##########
@@ -92,6 +72,22 @@ public void testEqualityDeletes() throws IOException {
     Assert.assertEquals("Table should contain expected rows", expected, actual);
   }
 
+  protected void generateTestData() throws IOException {
+    this.records = new ArrayList<>();

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

chenjunjiedada commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r496456617



##########
File path: data/src/test/java/org/apache/iceberg/data/DeletesReadTest.java
##########
@@ -25,49 +25,45 @@
 import org.apache.iceberg.DataFile;
 import org.apache.iceberg.DeleteFile;
 import org.apache.iceberg.Files;
+import org.apache.iceberg.PartitionSpec;
 import org.apache.iceberg.Schema;
 import org.apache.iceberg.Table;
-import org.apache.iceberg.TableTestBase;
 import org.apache.iceberg.TestHelpers.Row;
-import org.apache.iceberg.io.CloseableIterable;
 import org.apache.iceberg.relocated.com.google.common.collect.Lists;
 import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.types.Types;
 import org.apache.iceberg.util.ArrayUtil;
 import org.apache.iceberg.util.Pair;
 import org.apache.iceberg.util.StructLikeSet;
 import org.apache.iceberg.util.StructProjection;
 import org.junit.Assert;
-import org.junit.Before;
+import org.junit.Rule;
 import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
 
-public class TestGenericReaderDeletes extends TableTestBase {
-  public TestGenericReaderDeletes() {
-    super(2 /* format v2 with delete files */);
-  }
+import static org.apache.iceberg.types.Types.NestedField.required;
 
-  private List<Record> records = null;
-  private DataFile dataFile = null;
+public abstract class DeletesReadTest {
+  // Schema passed to create tables
+  public static final Schema SCHEMA = new Schema(
+      required(1, "id", Types.IntegerType.get()),
+      required(2, "data", Types.StringType.get())
+  );
 
-  @Before
-  public void writeTestDataFile() throws IOException {
-    this.records = Lists.newArrayList();
+  // Partition spec used to create tables
+  public static final PartitionSpec SPEC = PartitionSpec.builderFor(SCHEMA)
+      .bucket("data", 16)
+      .build();
 
-    // records all use IDs that are in bucket id_bucket=0
-    GenericRecord record = GenericRecord.create(table.schema());
-    records.add(record.copy("id", 29, "data", "a"));
-    records.add(record.copy("id", 43, "data", "b"));
-    records.add(record.copy("id", 61, "data", "c"));
-    records.add(record.copy("id", 89, "data", "d"));
-    records.add(record.copy("id", 100, "data", "e"));
-    records.add(record.copy("id", 121, "data", "f"));
-    records.add(record.copy("id", 122, "data", "g"));
+  protected Table table;
+  protected DataFile dataFile;
 
-    this.dataFile = FileHelpers.writeDataFile(table, Files.localOutput(temp.newFile()), Row.of(0), records);
+  private List<Record> records;
 
-    table.newAppend()
-        .appendFile(dataFile)
-        .commit();
-  }
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
+
+  public abstract Table createTable(String name, Schema schema, PartitionSpec spec) throws IOException;

Review comment:
       Updated to protected.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r496279063



##########
File path: mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java
##########
@@ -129,7 +133,7 @@
           // TODO: We do not support residual evaluation for HIVE and PIG in memory data model yet
           checkResiduals(task);
         }
-        splits.add(new IcebergSplit(conf, task));
+        splits.add(new IcebergSplit(conf, task, table.io(), table.encryption()));

Review comment:
       In that case, this can pass the `FileIO` somehow, or we can work on getting the other PR done before this one. But I don't think we should mix the two features together.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r496279294



##########
File path: mr/src/test/java/org/apache/iceberg/mr/TestMrReadDeletes.java
##########
@@ -0,0 +1,103 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr;
+
+import java.io.File;
+import java.io.IOException;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.BaseTable;
+import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TableMetadata;
+import org.apache.iceberg.TableOperations;
+import org.apache.iceberg.data.DeletesReadTest;
+import org.apache.iceberg.data.Record;
+import org.apache.iceberg.hadoop.HadoopTables;
+import org.apache.iceberg.util.StructLikeSet;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+@RunWith(Parameterized.class)
+public class TestMrReadDeletes extends DeletesReadTest {
+  private TestHelper helper;
+  private InputFormatConfig.ConfigBuilder builder;
+  private Configuration conf;
+
+  // parametrized variables
+  private final TestIcebergInputFormats.TestInputFormat.Factory<Record> testInputFormat;
+  private final FileFormat fileFormat;
+
+  @Parameterized.Parameters
+  public static Object[][] parameters() {
+    return new Object[][] {
+        new Object[] { "IcebergInputFormat", FileFormat.PARQUET },
+        new Object[] { "IcebergInputFormat", FileFormat.AVRO },
+        new Object[] { "IcebergInputFormat", FileFormat.ORC },
+        new Object[] { "MapredIcebergInputFormat", FileFormat.PARQUET },
+        new Object[] { "MapredIcebergInputFormat", FileFormat.AVRO },
+        new Object[] { "MapredIcebergInputFormat", FileFormat.ORC },
+    };
+  }
+
+  @Override
+  public Table createTable(String name, Schema schema, PartitionSpec spec) throws IOException {
+    Table table;
+    conf = new Configuration();
+    HadoopTables tables = new HadoopTables(conf);
+    File location = temp.newFolder(testInputFormat.name(), fileFormat.name());
+    Assert.assertTrue(location.delete());
+    helper = new TestHelper(conf, tables, location.toString(), schema, spec, fileFormat, temp);
+    table = helper.createTable();
+
+    TableOperations ops = ((BaseTable) table).operations();
+    TableMetadata meta = ops.current();
+    ops.commit(meta, meta.upgradeToFormatVersion(2));
+
+    return table;
+  }
+
+  @Override
+  public StructLikeSet rowSet(Table table, String... columns) {
+    Schema projected = table.schema().select(columns);
+    StructLikeSet set = StructLikeSet.create(projected.asStruct());
+    set.addAll(testInputFormat.create(builder.project(projected).conf()).getRecords());
+
+    return set;
+  }
+
+  public TestMrReadDeletes(String inputFormat, FileFormat fileFormat) {

Review comment:
       This constructor should be just below the parameters, and then abstract method implementations should come just afterwards.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r496281843



##########
File path: mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java
##########
@@ -129,7 +133,7 @@
           // TODO: We do not support residual evaluation for HIVE and PIG in memory data model yet
           checkResiduals(task);
         }
-        splits.add(new IcebergSplit(conf, task));
+        splits.add(new IcebergSplit(conf, task, table.io(), table.encryption()));

Review comment:
       For now, this could create a new `HadoopFileIO` and use that instead. That would be the easiest path forward.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r496305217



##########
File path: mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java
##########
@@ -129,7 +133,7 @@
           // TODO: We do not support residual evaluation for HIVE and PIG in memory data model yet
           checkResiduals(task);
         }
-        splits.add(new IcebergSplit(conf, task));
+        splits.add(new IcebergSplit(conf, task, table.io(), table.encryption()));

Review comment:
       @holdenk just added a `FileIO` instance to this class, so you can use that instead of mixing the two PRs together. Thanks @holdenk!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r493950613



##########
File path: data/src/test/java/org/apache/iceberg/data/GenericReaderDeletesTest.java
##########
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.data;
+
+import java.io.File;
+import java.io.IOException;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.TestTables;
+import org.apache.iceberg.types.Types;
+import org.junit.After;
+import org.junit.Before;
+
+import static org.apache.iceberg.types.Types.NestedField.required;
+
+public class GenericReaderDeletesTest extends DeletesReadTest {
+  // Schema passed to create tables
+  public static final Schema SCHEMA = new Schema(
+      required(1, "id", Types.IntegerType.get()),
+      required(2, "data", Types.StringType.get())
+  );

Review comment:
       Why not put the schema and spec in the parent class, `DeletesReadTest`? The data it generates is for this schema.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue merged pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

rdblue merged pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r496882493



##########
File path: data/src/test/java/org/apache/iceberg/data/DeletesReadTest.java
##########
@@ -25,53 +25,68 @@
 import org.apache.iceberg.DataFile;
 import org.apache.iceberg.DeleteFile;
 import org.apache.iceberg.Files;
+import org.apache.iceberg.PartitionSpec;
 import org.apache.iceberg.Schema;
 import org.apache.iceberg.Table;
-import org.apache.iceberg.TableTestBase;
 import org.apache.iceberg.TestHelpers.Row;
-import org.apache.iceberg.io.CloseableIterable;
 import org.apache.iceberg.relocated.com.google.common.collect.Lists;
 import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.types.Types;
 import org.apache.iceberg.util.ArrayUtil;
 import org.apache.iceberg.util.Pair;
 import org.apache.iceberg.util.StructLikeSet;
 import org.apache.iceberg.util.StructProjection;
+import org.junit.After;
 import org.junit.Assert;
 import org.junit.Before;
+import org.junit.Rule;
 import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
 
-public class TestGenericReaderDeletes extends TableTestBase {
-  public TestGenericReaderDeletes() {
-    super(2 /* format v2 with delete files */);
-  }
+import static org.apache.iceberg.types.Types.NestedField.required;
 
-  private List<Record> records = null;
-  private DataFile dataFile = null;
+public abstract class DeletesReadTest {
+  // Schema passed to create tables
+  public static final Schema SCHEMA = new Schema(
+      required(1, "id", Types.IntegerType.get()),
+      required(2, "data", Types.StringType.get())
+  );
 
-  @Before
-  public void writeTestDataFile() throws IOException {
-    this.records = Lists.newArrayList();
+  // Partition spec used to create tables
+  public static final PartitionSpec SPEC = PartitionSpec.builderFor(SCHEMA)
+      .bucket("data", 16)
+      .build();
 
-    // records all use IDs that are in bucket id_bucket=0
-    GenericRecord record = GenericRecord.create(table.schema());
-    records.add(record.copy("id", 29, "data", "a"));
-    records.add(record.copy("id", 43, "data", "b"));
-    records.add(record.copy("id", 61, "data", "c"));
-    records.add(record.copy("id", 89, "data", "d"));
-    records.add(record.copy("id", 100, "data", "e"));
-    records.add(record.copy("id", 121, "data", "f"));
-    records.add(record.copy("id", 122, "data", "g"));
+  protected final String testTableName = "test";
+  protected Table testTable;
+  protected DataFile dataFile;
+
+  private List<Record> records;
 
-    this.dataFile = FileHelpers.writeDataFile(table, Files.localOutput(temp.newFile()), Row.of(0), records);
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
 
-    table.newAppend()
+  @Before
+  public void prepareData() throws IOException {
+    this.testTable = createTable(testTableName, SCHEMA, SPEC);
+    generateTestData();
+    testTable.newAppend()
         .appendFile(dataFile)
         .commit();
   }
 
+  @After
+  public void cleanup() throws IOException {
+    dropTable(testTableName);
+  }
+
+  protected abstract Table createTable(String name, Schema schema, PartitionSpec spec) throws IOException;
+
+  protected abstract void dropTable(String name) throws IOException;
+
   @Test
   public void testEqualityDeletes() throws IOException {
-    Schema deleteRowSchema = table.schema().select("data");
+    Schema deleteRowSchema = testTable.schema().select("data");

Review comment:
       Renaming `table` -> `testTable` introduces unnecessary changes and doesn't add much value. What was the rationale for doing this?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r493951438



##########
File path: data/src/test/java/org/apache/iceberg/data/DeletesReadTest.java
##########
@@ -92,6 +72,22 @@ public void testEqualityDeletes() throws IOException {
     Assert.assertEquals("Table should contain expected rows", expected, actual);
   }
 
+  protected void generateTestData() throws IOException {
+    this.records = new ArrayList<>();

Review comment:
       We prefer using `Lists.newArrayList()`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

chenjunjiedada commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r494016568



##########
File path: mr/src/test/java/org/apache/iceberg/mr/TestMrReadDeletes.java
##########
@@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.Locale;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.BaseTable;
+import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.TableMetadata;
+import org.apache.iceberg.TableOperations;
+import org.apache.iceberg.data.DeletesReadTest;
+import org.apache.iceberg.data.Record;
+import org.apache.iceberg.hadoop.HadoopTables;
+import org.apache.iceberg.types.Types;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+import static org.apache.iceberg.types.Types.NestedField.required;
+
+@RunWith(Parameterized.class)
+public class TestMrReadDeletes extends DeletesReadTest {
+  // Schema passed to create tables
+  public static final Schema SCHEMA = new Schema(
+      required(1, "id", Types.IntegerType.get()),
+      required(2, "data", Types.StringType.get())
+  );
+
+  // Partition spec used to create tables
+  static final PartitionSpec SPEC = PartitionSpec.builderFor(SCHEMA)
+      .bucket("data", 16)
+      .build();
+
+  // parametrized variables
+  private final TestIcebergInputFormats.TestInputFormat.Factory<Record> testInputFormat;
+  private final FileFormat fileFormat;
+
+  @Parameterized.Parameters
+  public static Object[][] parameters() {
+    Object[][] parameters = new Object[TestIcebergInputFormats.TESTED_INPUT_FORMATS.size() *
+        TestIcebergInputFormats.TESTED_FILE_FORMATS.size()][2];
+
+    int idx = 0;
+
+    for (TestIcebergInputFormats.TestInputFormat.Factory<Record> inputFormat :
+        TestIcebergInputFormats.TESTED_INPUT_FORMATS) {
+      for (String fileFormat : TestIcebergInputFormats.TESTED_FILE_FORMATS) {
+        parameters[idx++] = new Object[] {inputFormat, fileFormat};
+      }
+    }
+
+    return parameters;

Review comment:
       Yes, I just updated it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r493952200



##########
File path: mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java
##########
@@ -248,6 +258,26 @@ public void close() throws IOException {
       return iterable;
     }
 
+    @SuppressWarnings("unchecked")
+    private CloseableIterable<T> open(FileScanTask currentTask, Schema readSchema) {
+      CloseableIterable<T> iter;
+      switch (inMemoryDataModel) {
+        case PIG:
+        case HIVE:
+          // TODO implement value readers for Pig and Hive
+          throw new UnsupportedOperationException("Avro support not yet supported for Pig and Hive");
+        case GENERIC:
+          DeleteFilter deletes = new GenericDeleteFilter(io, currentTask, tableSchema, readSchema);
+          Schema requiredSchema = deletes.requiredSchema();
+          iter = deletes.filter(openTask(currentTask, requiredSchema));

Review comment:
       Why not return `deletes.filter(...)` here? That would remove the need for `iter` and `break`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r496887185



##########
File path: data/src/test/java/org/apache/iceberg/data/GenericReaderDeletesTest.java
##########
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.data;
+
+import java.io.File;
+import java.io.IOException;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TestTables;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.util.StructLikeSet;
+import org.junit.After;
+
+public class GenericReaderDeletesTest extends DeletesReadTest {
+
+  @Override
+  public Table createTable(String name, Schema schema, PartitionSpec spec) throws IOException {

Review comment:
       This shouldn't be `public` when the abstract method is `protected`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r496278618



##########
File path: mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java
##########
@@ -248,6 +258,26 @@ public void close() throws IOException {
       return iterable;
     }
 
+    @SuppressWarnings("unchecked")

Review comment:
       Okay, makes sense.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

chenjunjiedada commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r497141257



##########
File path: mr/src/test/java/org/apache/iceberg/mr/TestIcebergInputFormats.java
##########
@@ -370,7 +370,7 @@ public void testCustomCatalog() throws IOException {
     testInputFormat.create(builder.conf()).validate(expectedRecords);
   }
 
-  private abstract static class TestInputFormat<T> {
+  public abstract static class TestInputFormat<T> {

Review comment:
       Hmm, I thought accessing `TestIcebergInputFormats.TestInputFormat.newFactory` would need it to be public. It turns out not necessary. Updated.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

chenjunjiedada commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r497139789



##########
File path: data/src/test/java/org/apache/iceberg/data/GenericReaderDeletesTest.java
##########
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.data;
+
+import java.io.File;
+import java.io.IOException;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TestTables;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.util.StructLikeSet;
+import org.junit.After;
+
+public class GenericReaderDeletesTest extends DeletesReadTest {
+
+  @Override
+  public Table createTable(String name, Schema schema, PartitionSpec spec) throws IOException {

Review comment:
       Fixed.

##########
File path: mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java
##########
@@ -250,6 +252,22 @@ public void close() throws IOException {
       return iterable;
     }
 
+    @SuppressWarnings("unchecked")
+    private CloseableIterable<T> open(FileScanTask currentTask, Schema readSchema) {
+      switch (inMemoryDataModel) {
+        case PIG:
+        case HIVE:
+          // TODO implement value readers for Pig and Hive
+          throw new UnsupportedOperationException("Pig and Hive are not supported for any format");

Review comment:
       Updated.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

chenjunjiedada commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r497139624



##########
File path: data/src/test/java/org/apache/iceberg/data/DeletesReadTest.java
##########
@@ -25,53 +25,68 @@
 import org.apache.iceberg.DataFile;
 import org.apache.iceberg.DeleteFile;
 import org.apache.iceberg.Files;
+import org.apache.iceberg.PartitionSpec;
 import org.apache.iceberg.Schema;
 import org.apache.iceberg.Table;
-import org.apache.iceberg.TableTestBase;
 import org.apache.iceberg.TestHelpers.Row;
-import org.apache.iceberg.io.CloseableIterable;
 import org.apache.iceberg.relocated.com.google.common.collect.Lists;
 import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.types.Types;
 import org.apache.iceberg.util.ArrayUtil;
 import org.apache.iceberg.util.Pair;
 import org.apache.iceberg.util.StructLikeSet;
 import org.apache.iceberg.util.StructProjection;
+import org.junit.After;
 import org.junit.Assert;
 import org.junit.Before;
+import org.junit.Rule;
 import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
 
-public class TestGenericReaderDeletes extends TableTestBase {
-  public TestGenericReaderDeletes() {
-    super(2 /* format v2 with delete files */);
-  }
+import static org.apache.iceberg.types.Types.NestedField.required;
 
-  private List<Record> records = null;
-  private DataFile dataFile = null;
+public abstract class DeletesReadTest {
+  // Schema passed to create tables
+  public static final Schema SCHEMA = new Schema(
+      required(1, "id", Types.IntegerType.get()),
+      required(2, "data", Types.StringType.get())
+  );
 
-  @Before
-  public void writeTestDataFile() throws IOException {
-    this.records = Lists.newArrayList();
+  // Partition spec used to create tables
+  public static final PartitionSpec SPEC = PartitionSpec.builderFor(SCHEMA)
+      .bucket("data", 16)
+      .build();
 
-    // records all use IDs that are in bucket id_bucket=0
-    GenericRecord record = GenericRecord.create(table.schema());
-    records.add(record.copy("id", 29, "data", "a"));
-    records.add(record.copy("id", 43, "data", "b"));
-    records.add(record.copy("id", 61, "data", "c"));
-    records.add(record.copy("id", 89, "data", "d"));
-    records.add(record.copy("id", 100, "data", "e"));
-    records.add(record.copy("id", 121, "data", "f"));
-    records.add(record.copy("id", 122, "data", "g"));
+  protected final String testTableName = "test";
+  protected Table testTable;
+  protected DataFile dataFile;
+
+  private List<Record> records;
 
-    this.dataFile = FileHelpers.writeDataFile(table, Files.localOutput(temp.newFile()), Row.of(0), records);
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
 
-    table.newAppend()
+  @Before
+  public void prepareData() throws IOException {
+    this.testTable = createTable(testTableName, SCHEMA, SPEC);
+    generateTestData();
+    testTable.newAppend()
         .appendFile(dataFile)
         .commit();
   }
 
+  @After
+  public void cleanup() throws IOException {
+    dropTable(testTableName);
+  }
+
+  protected abstract Table createTable(String name, Schema schema, PartitionSpec spec) throws IOException;
+
+  protected abstract void dropTable(String name) throws IOException;
+
   @Test
   public void testEqualityDeletes() throws IOException {
-    Schema deleteRowSchema = table.schema().select("data");
+    Schema deleteRowSchema = testTable.schema().select("data");

Review comment:
       The `checkstyle` complaints hidden field since the `rowSet` use the parameter name `table`. I just reverted the changes back to change `rowSet` parameter name.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r498379248



##########
File path: data/src/test/java/org/apache/iceberg/data/DeletesReadTest.java
##########
@@ -63,12 +82,20 @@ public void writeTestDataFile() throws IOException {
     records.add(record.copy("id", 122, "data", "g"));
 
     this.dataFile = FileHelpers.writeDataFile(table, Files.localOutput(temp.newFile()), Row.of(0), records);
-

Review comment:
       Nit: whitespace-only change.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

chenjunjiedada commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r494018345



##########
File path: data/src/test/java/org/apache/iceberg/data/DeletesReadTest.java
##########
@@ -269,4 +265,5 @@ private StructLikeSet rowSetWithoutIds(int... idsToRemove) {
         .forEach(set::add);
     return set;
   }
+

Review comment:
       Removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r496275564



##########
File path: spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeletes.java
##########
@@ -20,285 +20,111 @@
 package org.apache.iceberg.spark.source;
 
 import java.io.IOException;
-import java.util.List;
-import java.util.Set;
+import org.apache.hadoop.hive.conf.HiveConf;
 import org.apache.iceberg.BaseTable;
-import org.apache.iceberg.DataFile;
-import org.apache.iceberg.DeleteFile;
-import org.apache.iceberg.Files;
+import org.apache.iceberg.PartitionSpec;
 import org.apache.iceberg.Schema;
 import org.apache.iceberg.Table;
 import org.apache.iceberg.TableMetadata;
 import org.apache.iceberg.TableOperations;
-import org.apache.iceberg.TestHelpers.Row;
+import org.apache.iceberg.catalog.Namespace;
 import org.apache.iceberg.catalog.TableIdentifier;
-import org.apache.iceberg.data.FileHelpers;
-import org.apache.iceberg.data.GenericRecord;
-import org.apache.iceberg.data.Record;
-import org.apache.iceberg.relocated.com.google.common.collect.Lists;
-import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.data.DeletesReadTest;
+import org.apache.iceberg.exceptions.AlreadyExistsException;
+import org.apache.iceberg.hive.HiveCatalog;
+import org.apache.iceberg.hive.TestHiveMetastore;
 import org.apache.iceberg.spark.SparkStructLike;
-import org.apache.iceberg.spark.SparkTestBase;
 import org.apache.iceberg.types.Types;
-import org.apache.iceberg.util.ArrayUtil;
-import org.apache.iceberg.util.Pair;
 import org.apache.iceberg.util.StructLikeSet;
-import org.apache.iceberg.util.StructProjection;
 import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.internal.SQLConf;
 import org.junit.After;
-import org.junit.Assert;
+import org.junit.AfterClass;
 import org.junit.Before;
-import org.junit.Rule;
-import org.junit.Test;
-import org.junit.rules.TemporaryFolder;
+import org.junit.BeforeClass;
 
-public abstract class TestSparkReaderDeletes extends SparkTestBase {
-  private static final Schema SCHEMA = new Schema(
-      Types.NestedField.required(1, "id", Types.IntegerType.get()),
-      Types.NestedField.required(2, "data", Types.StringType.get()));
-  private Table table = null;
-  private List<Record> records = null;
-  private DataFile dataFile = null;
+import static org.apache.hadoop.hive.conf.HiveConf.ConfVars.METASTOREURIS;
 
-  @Rule
-  public TemporaryFolder temp = new TemporaryFolder();
+public abstract class TestSparkReaderDeletes extends DeletesReadTest {
 
-  @Before
-  public void createTable() throws IOException {
-    this.table = catalog.createTable(TableIdentifier.of("default", "table"), SCHEMA);
-    TableOperations ops = ((BaseTable) table).operations();
-    TableMetadata meta = ops.current();
-    ops.commit(meta, meta.upgradeToFormatVersion(2));
-
-    this.records = Lists.newArrayList();
+  private static TestHiveMetastore metastore = null;
+  protected static SparkSession spark = null;
+  protected static HiveCatalog catalog = null;
 
-    // records all use IDs that are in bucket id_bucket=0
-    GenericRecord record = GenericRecord.create(table.schema());
-    records.add(record.copy("id", 29, "data", "a"));
-    records.add(record.copy("id", 43, "data", "b"));
-    records.add(record.copy("id", 61, "data", "c"));
-    records.add(record.copy("id", 89, "data", "d"));
-    records.add(record.copy("id", 100, "data", "e"));
-    records.add(record.copy("id", 121, "data", "f"));
-    records.add(record.copy("id", 122, "data", "g"));
+  @BeforeClass
+  public static void startMetastoreAndSpark() {
+    metastore = new TestHiveMetastore();
+    metastore.start();
+    HiveConf hiveConf = metastore.hiveConf();
 
-    this.dataFile = FileHelpers.writeDataFile(table, Files.localOutput(temp.newFile()), Row.of(0), records);
+    spark = SparkSession.builder()
+        .master("local[2]")
+        .config(SQLConf.PARTITION_OVERWRITE_MODE().key(), "dynamic")
+        .config("spark.hadoop." + METASTOREURIS.varname, hiveConf.get(METASTOREURIS.varname))
+        .enableHiveSupport()
+        .getOrCreate();
 
-    table.newAppend()
-        .appendFile(dataFile)
-        .commit();
-  }
+    catalog = new HiveCatalog(spark.sessionState().newHadoopConf());
 
-  @After
-  public void dropTable() {
-    catalog.dropTable(TableIdentifier.of("default", "table"));
+    try {
+      catalog.createNamespace(Namespace.of("default"));
+    } catch (AlreadyExistsException ignored) {
+      // the default namespace already exists. ignore the create error
+    }
   }
 
-  @Test
-  public void testEqualityDeletes() throws IOException {
-    Schema deleteRowSchema = table.schema().select("data");
-    Record dataDelete = GenericRecord.create(deleteRowSchema);
-    List<Record> dataDeletes = Lists.newArrayList(
-        dataDelete.copy("data", "a"), // id = 29
-        dataDelete.copy("data", "d"), // id = 89
-        dataDelete.copy("data", "g") // id = 122
-    );
-
-    DeleteFile eqDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), dataDeletes, deleteRowSchema);
-
-    table.newRowDelta()
-        .addDeletes(eqDeletes)
-        .commit();
-
-    StructLikeSet expected = rowSetWithoutIds(29, 89, 122);
-    StructLikeSet actual = rowSet(table);
-
-    Assert.assertEquals("Table should contain expected rows", expected, actual);
+  @AfterClass
+  public static void stopMetastoreAndSpark() {
+    catalog.close();
+    catalog = null;
+    metastore.stop();
+    metastore = null;
+    spark.stop();
+    spark = null;
   }
 
-  @Test
-  public void testEqualityDeletesWithRequiredEqColumn() throws IOException {
-    Schema deleteRowSchema = table.schema().select("data");
-    Record dataDelete = GenericRecord.create(deleteRowSchema);
-    List<Record> dataDeletes = Lists.newArrayList(
-        dataDelete.copy("data", "a"), // id = 29
-        dataDelete.copy("data", "d"), // id = 89
-        dataDelete.copy("data", "g") // id = 122
-    );
-
-    DeleteFile eqDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), dataDeletes, deleteRowSchema);
-
-    table.newRowDelta()
-        .addDeletes(eqDeletes)
-        .commit();
-
-    StructLikeSet expected = selectColumns(rowSetWithoutIds(29, 89, 122), "id");
-    StructLikeSet actual = rowSet(table, "id"); // data is added by the reader to apply the eq deletes
-
-    Assert.assertEquals("Table should contain expected rows", expected, actual);
-  }
-
-  @Test
-  public void testPositionDeletes() throws IOException {
-    List<Pair<CharSequence, Long>> deletes = Lists.newArrayList(
-        Pair.of(dataFile.path(), 0L), // id = 29
-        Pair.of(dataFile.path(), 3L), // id = 89
-        Pair.of(dataFile.path(), 6L) // id = 122
-    );
-
-    DeleteFile posDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), deletes);
-
-    table.newRowDelta()
-        .addDeletes(posDeletes)
-        .commit();
-
-    StructLikeSet expected = rowSetWithoutIds(29, 89, 122);
-    StructLikeSet actual = rowSet(table);
-
-    Assert.assertEquals("Table should contain expected rows", expected, actual);
-  }
-
-  @Test
-  public void testMixedPositionAndEqualityDeletes() throws IOException {
-    Schema dataSchema = table.schema().select("data");
-    Record dataDelete = GenericRecord.create(dataSchema);
-    List<Record> dataDeletes = Lists.newArrayList(
-        dataDelete.copy("data", "a"), // id = 29
-        dataDelete.copy("data", "d"), // id = 89
-        dataDelete.copy("data", "g") // id = 122
-    );
-
-    DeleteFile eqDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), dataDeletes, dataSchema);
-
-    List<Pair<CharSequence, Long>> deletes = Lists.newArrayList(
-        Pair.of(dataFile.path(), 3L), // id = 89
-        Pair.of(dataFile.path(), 5L) // id = 121
-    );
+  @Before
+  public void prepareData() throws IOException {
+    this.table = createTable("table", SCHEMA, SPEC);

Review comment:
       Isn't this identical to the one in the generics test? I think this should be in the base class instead.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

chenjunjiedada commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r494016363



##########
File path: data/src/test/java/org/apache/iceberg/data/GenericReaderDeletesTest.java
##########
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.data;
+
+import java.io.File;
+import java.io.IOException;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.TestTables;
+import org.apache.iceberg.types.Types;
+import org.junit.After;
+import org.junit.Before;
+
+import static org.apache.iceberg.types.Types.NestedField.required;
+
+public class GenericReaderDeletesTest extends DeletesReadTest {
+  // Schema passed to create tables
+  public static final Schema SCHEMA = new Schema(
+      required(1, "id", Types.IntegerType.get()),
+      required(2, "data", Types.StringType.get())
+  );
+
+  // Partition spec used to create tables
+  static final PartitionSpec SPEC = PartitionSpec.builderFor(SCHEMA)
+      .bucket("data", 16)
+      .build();
+
+  @Before
+  public void writeTestDataFile() throws IOException {
+    File tableDir = temp.newFolder();
+    tableDir.delete();
+    this.table = TestTables.create(tableDir, "test", SCHEMA, SPEC, 2);

Review comment:
       Make sense to me. Updated.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

chenjunjiedada commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r493987692



##########
File path: mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java
##########
@@ -129,7 +133,7 @@
           // TODO: We do not support residual evaluation for HIVE and PIG in memory data model yet
           checkResiduals(task);
         }
-        splits.add(new IcebergSplit(conf, task));
+        splits.add(new IcebergSplit(conf, task, table.io(), table.encryption()));

Review comment:
       The `GenericDeleteFilter` constructor needs `FileIO` as parameter.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

chenjunjiedada commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r496456968



##########
File path: mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java
##########
@@ -248,6 +258,22 @@ public void close() throws IOException {
       return iterable;
     }
 
+    @SuppressWarnings("unchecked")
+    private CloseableIterable<T> open(FileScanTask currentTask, Schema readSchema) {
+      switch (inMemoryDataModel) {
+        case PIG:
+        case HIVE:
+          // TODO implement value readers for Pig and Hive
+          throw new UnsupportedOperationException("Avro support not yet supported for Pig and Hive");

Review comment:
       This is a copy-paste issue, just updated.

##########
File path: mr/src/test/java/org/apache/iceberg/mr/TestMrReadDeletes.java
##########
@@ -0,0 +1,103 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr;
+
+import java.io.File;
+import java.io.IOException;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.BaseTable;
+import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TableMetadata;
+import org.apache.iceberg.TableOperations;
+import org.apache.iceberg.data.DeletesReadTest;
+import org.apache.iceberg.data.Record;
+import org.apache.iceberg.hadoop.HadoopTables;
+import org.apache.iceberg.util.StructLikeSet;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+@RunWith(Parameterized.class)
+public class TestMrReadDeletes extends DeletesReadTest {
+  private TestHelper helper;
+  private InputFormatConfig.ConfigBuilder builder;
+  private Configuration conf;
+
+  // parametrized variables
+  private final TestIcebergInputFormats.TestInputFormat.Factory<Record> testInputFormat;
+  private final FileFormat fileFormat;
+
+  @Parameterized.Parameters
+  public static Object[][] parameters() {
+    return new Object[][] {
+        new Object[] { "IcebergInputFormat", FileFormat.PARQUET },
+        new Object[] { "IcebergInputFormat", FileFormat.AVRO },
+        new Object[] { "IcebergInputFormat", FileFormat.ORC },
+        new Object[] { "MapredIcebergInputFormat", FileFormat.PARQUET },
+        new Object[] { "MapredIcebergInputFormat", FileFormat.AVRO },
+        new Object[] { "MapredIcebergInputFormat", FileFormat.ORC },
+    };
+  }
+
+  @Override
+  public Table createTable(String name, Schema schema, PartitionSpec spec) throws IOException {
+    Table table;
+    conf = new Configuration();
+    HadoopTables tables = new HadoopTables(conf);
+    File location = temp.newFolder(testInputFormat.name(), fileFormat.name());
+    Assert.assertTrue(location.delete());
+    helper = new TestHelper(conf, tables, location.toString(), schema, spec, fileFormat, temp);
+    table = helper.createTable();
+
+    TableOperations ops = ((BaseTable) table).operations();
+    TableMetadata meta = ops.current();
+    ops.commit(meta, meta.upgradeToFormatVersion(2));
+
+    return table;
+  }
+
+  @Override
+  public StructLikeSet rowSet(Table table, String... columns) {
+    Schema projected = table.schema().select(columns);
+    StructLikeSet set = StructLikeSet.create(projected.asStruct());
+    set.addAll(testInputFormat.create(builder.project(projected).conf()).getRecords());
+
+    return set;
+  }
+
+  public TestMrReadDeletes(String inputFormat, FileFormat fileFormat) {

Review comment:
       Fixed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

chenjunjiedada commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r494015472



##########
File path: mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java
##########
@@ -248,6 +258,26 @@ public void close() throws IOException {
       return iterable;
     }
 
+    @SuppressWarnings("unchecked")

Review comment:
       `deletes.filter(...)` needs this.

##########
File path: spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeletes.java
##########
@@ -93,212 +102,4 @@ public void createTable() throws IOException {
   public void dropTable() {
     catalog.dropTable(TableIdentifier.of("default", "table"));
   }
-
-  @Test
-  public void testEqualityDeletes() throws IOException {
-    Schema deleteRowSchema = table.schema().select("data");
-    Record dataDelete = GenericRecord.create(deleteRowSchema);
-    List<Record> dataDeletes = Lists.newArrayList(
-        dataDelete.copy("data", "a"), // id = 29
-        dataDelete.copy("data", "d"), // id = 89
-        dataDelete.copy("data", "g") // id = 122
-    );
-
-    DeleteFile eqDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), dataDeletes, deleteRowSchema);
-
-    table.newRowDelta()
-        .addDeletes(eqDeletes)
-        .commit();
-
-    StructLikeSet expected = rowSetWithoutIds(29, 89, 122);
-    StructLikeSet actual = rowSet(table);
-
-    Assert.assertEquals("Table should contain expected rows", expected, actual);
-  }
-
-  @Test
-  public void testEqualityDeletesWithRequiredEqColumn() throws IOException {
-    Schema deleteRowSchema = table.schema().select("data");
-    Record dataDelete = GenericRecord.create(deleteRowSchema);
-    List<Record> dataDeletes = Lists.newArrayList(
-        dataDelete.copy("data", "a"), // id = 29
-        dataDelete.copy("data", "d"), // id = 89
-        dataDelete.copy("data", "g") // id = 122
-    );
-
-    DeleteFile eqDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), dataDeletes, deleteRowSchema);
-
-    table.newRowDelta()
-        .addDeletes(eqDeletes)
-        .commit();
-
-    StructLikeSet expected = selectColumns(rowSetWithoutIds(29, 89, 122), "id");
-    StructLikeSet actual = rowSet(table, "id"); // data is added by the reader to apply the eq deletes
-
-    Assert.assertEquals("Table should contain expected rows", expected, actual);
-  }
-
-  @Test
-  public void testPositionDeletes() throws IOException {
-    List<Pair<CharSequence, Long>> deletes = Lists.newArrayList(
-        Pair.of(dataFile.path(), 0L), // id = 29
-        Pair.of(dataFile.path(), 3L), // id = 89
-        Pair.of(dataFile.path(), 6L) // id = 122
-    );
-
-    DeleteFile posDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), deletes);
-
-    table.newRowDelta()
-        .addDeletes(posDeletes)
-        .commit();
-
-    StructLikeSet expected = rowSetWithoutIds(29, 89, 122);
-    StructLikeSet actual = rowSet(table);
-
-    Assert.assertEquals("Table should contain expected rows", expected, actual);
-  }
-
-  @Test
-  public void testMixedPositionAndEqualityDeletes() throws IOException {
-    Schema dataSchema = table.schema().select("data");
-    Record dataDelete = GenericRecord.create(dataSchema);
-    List<Record> dataDeletes = Lists.newArrayList(
-        dataDelete.copy("data", "a"), // id = 29
-        dataDelete.copy("data", "d"), // id = 89
-        dataDelete.copy("data", "g") // id = 122
-    );
-
-    DeleteFile eqDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), dataDeletes, dataSchema);
-
-    List<Pair<CharSequence, Long>> deletes = Lists.newArrayList(
-        Pair.of(dataFile.path(), 3L), // id = 89
-        Pair.of(dataFile.path(), 5L) // id = 121
-    );
-
-    DeleteFile posDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), deletes);
-
-    table.newRowDelta()
-        .addDeletes(eqDeletes)
-        .addDeletes(posDeletes)
-        .commit();
-
-    StructLikeSet expected = rowSetWithoutIds(29, 89, 121, 122);
-    StructLikeSet actual = rowSet(table);
-
-    Assert.assertEquals("Table should contain expected rows", expected, actual);
-  }
-
-  @Test
-  public void testMultipleEqualityDeleteSchemas() throws IOException {
-    Schema dataSchema = table.schema().select("data");
-    Record dataDelete = GenericRecord.create(dataSchema);
-    List<Record> dataDeletes = Lists.newArrayList(
-        dataDelete.copy("data", "a"), // id = 29
-        dataDelete.copy("data", "d"), // id = 89
-        dataDelete.copy("data", "g") // id = 122
-    );
-
-    DeleteFile dataEqDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), dataDeletes, dataSchema);
-
-    Schema idSchema = table.schema().select("id");
-    Record idDelete = GenericRecord.create(idSchema);
-    List<Record> idDeletes = Lists.newArrayList(
-        idDelete.copy("id", 121), // id = 121
-        idDelete.copy("id", 29) // id = 29
-    );
-
-    DeleteFile idEqDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), idDeletes, idSchema);
-
-    table.newRowDelta()
-        .addDeletes(dataEqDeletes)
-        .addDeletes(idEqDeletes)
-        .commit();
-
-    StructLikeSet expected = rowSetWithoutIds(29, 89, 121, 122);
-    StructLikeSet actual = rowSet(table);
-
-    Assert.assertEquals("Table should contain expected rows", expected, actual);
-  }
-
-  @Test
-  public void testEqualityDeleteByNull() throws IOException {
-    // data is required in the test table; make it optional for this test
-    table.updateSchema()
-        .makeColumnOptional("data")
-        .commit();
-
-    // add a new data file with a record where data is null
-    Record record = GenericRecord.create(table.schema());
-    DataFile dataFileWithNull = FileHelpers.writeDataFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0),
-        Lists.newArrayList(record.copy("id", 131, "data", null)));
-
-    table.newAppend()
-        .appendFile(dataFileWithNull)
-        .commit();
-
-    // delete where data is null
-    Schema dataSchema = table.schema().select("data");
-    Record dataDelete = GenericRecord.create(dataSchema);
-    List<Record> dataDeletes = Lists.newArrayList(
-        dataDelete.copy("data", null) // id = 131
-    );
-
-    DeleteFile eqDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), dataDeletes, dataSchema);
-
-    table.newRowDelta()
-        .addDeletes(eqDeletes)
-        .commit();
-
-    StructLikeSet expected = rowSetWithoutIds(131);
-    StructLikeSet actual = rowSet(table);
-
-    Assert.assertEquals("Table should contain expected rows", expected, actual);
-  }
-
-  private static StructLikeSet rowSet(Table table) {
-    return rowSet(table, "*");
-  }
-
-  private static StructLikeSet rowSet(Table table, String... columns) {

Review comment:
       Sorry, I missed this. I just added this back and also use input format to read records.

##########
File path: data/src/test/java/org/apache/iceberg/data/DeletesReadTest.java
##########
@@ -148,15 +144,15 @@ public void testMixedPositionAndEqualityDeletes() throws IOException {
     );
 
     DeleteFile eqDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), dataDeletes, dataSchema);
+        table, Files.localOutput(temp.newFile()), TestHelpers.Row.of(0), dataDeletes, dataSchema);

Review comment:
       Done.

##########
File path: data/src/test/java/org/apache/iceberg/data/GenericReaderDeletesTest.java
##########
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.data;
+
+import java.io.File;
+import java.io.IOException;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.TestTables;
+import org.apache.iceberg.types.Types;
+import org.junit.After;
+import org.junit.Before;
+
+import static org.apache.iceberg.types.Types.NestedField.required;
+
+public class GenericReaderDeletesTest extends DeletesReadTest {
+  // Schema passed to create tables
+  public static final Schema SCHEMA = new Schema(
+      required(1, "id", Types.IntegerType.get()),
+      required(2, "data", Types.StringType.get())
+  );

Review comment:
       Updated.

##########
File path: data/src/test/java/org/apache/iceberg/data/DeletesReadTest.java
##########
@@ -92,6 +72,22 @@ public void testEqualityDeletes() throws IOException {
     Assert.assertEquals("Table should contain expected rows", expected, actual);
   }
 
+  protected void generateTestData() throws IOException {
+    this.records = new ArrayList<>();

Review comment:
       Done.

##########
File path: data/src/test/java/org/apache/iceberg/data/GenericReaderDeletesTest.java
##########
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.data;
+
+import java.io.File;
+import java.io.IOException;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.TestTables;
+import org.apache.iceberg.types.Types;
+import org.junit.After;
+import org.junit.Before;
+
+import static org.apache.iceberg.types.Types.NestedField.required;
+
+public class GenericReaderDeletesTest extends DeletesReadTest {
+  // Schema passed to create tables
+  public static final Schema SCHEMA = new Schema(
+      required(1, "id", Types.IntegerType.get()),
+      required(2, "data", Types.StringType.get())
+  );
+
+  // Partition spec used to create tables
+  static final PartitionSpec SPEC = PartitionSpec.builderFor(SCHEMA)
+      .bucket("data", 16)
+      .build();
+
+  @Before
+  public void writeTestDataFile() throws IOException {
+    File tableDir = temp.newFolder();
+    tableDir.delete();
+    this.table = TestTables.create(tableDir, "test", SCHEMA, SPEC, 2);

Review comment:
       Make sense to me. Updated.

##########
File path: mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java
##########
@@ -248,6 +258,26 @@ public void close() throws IOException {
       return iterable;
     }
 
+    @SuppressWarnings("unchecked")
+    private CloseableIterable<T> open(FileScanTask currentTask, Schema readSchema) {
+      CloseableIterable<T> iter;
+      switch (inMemoryDataModel) {
+        case PIG:
+        case HIVE:
+          // TODO implement value readers for Pig and Hive
+          throw new UnsupportedOperationException("Avro support not yet supported for Pig and Hive");
+        case GENERIC:
+          DeleteFilter deletes = new GenericDeleteFilter(io, currentTask, tableSchema, readSchema);
+          Schema requiredSchema = deletes.requiredSchema();
+          iter = deletes.filter(openTask(currentTask, requiredSchema));

Review comment:
       Updated.

##########
File path: mr/src/test/java/org/apache/iceberg/mr/TestMrReadDeletes.java
##########
@@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.Locale;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.BaseTable;
+import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.TableMetadata;
+import org.apache.iceberg.TableOperations;
+import org.apache.iceberg.data.DeletesReadTest;
+import org.apache.iceberg.data.Record;
+import org.apache.iceberg.hadoop.HadoopTables;
+import org.apache.iceberg.types.Types;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+import static org.apache.iceberg.types.Types.NestedField.required;
+
+@RunWith(Parameterized.class)
+public class TestMrReadDeletes extends DeletesReadTest {
+  // Schema passed to create tables
+  public static final Schema SCHEMA = new Schema(
+      required(1, "id", Types.IntegerType.get()),
+      required(2, "data", Types.StringType.get())
+  );
+
+  // Partition spec used to create tables
+  static final PartitionSpec SPEC = PartitionSpec.builderFor(SCHEMA)
+      .bucket("data", 16)
+      .build();
+
+  // parametrized variables
+  private final TestIcebergInputFormats.TestInputFormat.Factory<Record> testInputFormat;
+  private final FileFormat fileFormat;
+
+  @Parameterized.Parameters
+  public static Object[][] parameters() {
+    Object[][] parameters = new Object[TestIcebergInputFormats.TESTED_INPUT_FORMATS.size() *
+        TestIcebergInputFormats.TESTED_FILE_FORMATS.size()][2];
+
+    int idx = 0;
+
+    for (TestIcebergInputFormats.TestInputFormat.Factory<Record> inputFormat :
+        TestIcebergInputFormats.TESTED_INPUT_FORMATS) {
+      for (String fileFormat : TestIcebergInputFormats.TESTED_FILE_FORMATS) {
+        parameters[idx++] = new Object[] {inputFormat, fileFormat};
+      }
+    }
+
+    return parameters;

Review comment:
       Yes, I just updated it.

##########
File path: data/src/test/java/org/apache/iceberg/data/DeletesReadTest.java
##########
@@ -269,4 +265,5 @@ private StructLikeSet rowSetWithoutIds(int... idsToRemove) {
         .forEach(set::add);
     return set;
   }
+

Review comment:
       Removed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r493951982



##########
File path: mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java
##########
@@ -248,6 +258,26 @@ public void close() throws IOException {
       return iterable;
     }
 
+    @SuppressWarnings("unchecked")

Review comment:
       Why is this needed?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

chenjunjiedada commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r497137754



##########
File path: data/src/test/java/org/apache/iceberg/data/DeletesReadTest.java
##########
@@ -25,53 +25,68 @@
 import org.apache.iceberg.DataFile;
 import org.apache.iceberg.DeleteFile;
 import org.apache.iceberg.Files;
+import org.apache.iceberg.PartitionSpec;
 import org.apache.iceberg.Schema;
 import org.apache.iceberg.Table;
-import org.apache.iceberg.TableTestBase;
 import org.apache.iceberg.TestHelpers.Row;
-import org.apache.iceberg.io.CloseableIterable;
 import org.apache.iceberg.relocated.com.google.common.collect.Lists;
 import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.types.Types;
 import org.apache.iceberg.util.ArrayUtil;
 import org.apache.iceberg.util.Pair;
 import org.apache.iceberg.util.StructLikeSet;
 import org.apache.iceberg.util.StructProjection;
+import org.junit.After;
 import org.junit.Assert;
 import org.junit.Before;
+import org.junit.Rule;
 import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
 
-public class TestGenericReaderDeletes extends TableTestBase {
-  public TestGenericReaderDeletes() {
-    super(2 /* format v2 with delete files */);
-  }
+import static org.apache.iceberg.types.Types.NestedField.required;
 
-  private List<Record> records = null;
-  private DataFile dataFile = null;
+public abstract class DeletesReadTest {
+  // Schema passed to create tables
+  public static final Schema SCHEMA = new Schema(
+      required(1, "id", Types.IntegerType.get()),
+      required(2, "data", Types.StringType.get())
+  );
 
-  @Before
-  public void writeTestDataFile() throws IOException {
-    this.records = Lists.newArrayList();
+  // Partition spec used to create tables
+  public static final PartitionSpec SPEC = PartitionSpec.builderFor(SCHEMA)
+      .bucket("data", 16)
+      .build();
 
-    // records all use IDs that are in bucket id_bucket=0
-    GenericRecord record = GenericRecord.create(table.schema());
-    records.add(record.copy("id", 29, "data", "a"));
-    records.add(record.copy("id", 43, "data", "b"));
-    records.add(record.copy("id", 61, "data", "c"));
-    records.add(record.copy("id", 89, "data", "d"));
-    records.add(record.copy("id", 100, "data", "e"));
-    records.add(record.copy("id", 121, "data", "f"));
-    records.add(record.copy("id", 122, "data", "g"));
+  protected final String testTableName = "test";
+  protected Table testTable;
+  protected DataFile dataFile;

Review comment:
       It should be private now. Updated.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r496882219



##########
File path: data/src/test/java/org/apache/iceberg/data/DeletesReadTest.java
##########
@@ -80,21 +95,37 @@ public void testEqualityDeletes() throws IOException {
     );
 
     DeleteFile eqDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), dataDeletes, deleteRowSchema);
+        testTable, Files.localOutput(temp.newFile()), Row.of(0), dataDeletes, deleteRowSchema);
 
-    table.newRowDelta()
+    testTable.newRowDelta()
         .addDeletes(eqDeletes)
         .commit();
 
     StructLikeSet expected = rowSetWithoutIds(29, 89, 122);
-    StructLikeSet actual = rowSet(table);
+    StructLikeSet actual = rowSet(testTable);
 
     Assert.assertEquals("Table should contain expected rows", expected, actual);
   }
 
+  protected void generateTestData() throws IOException {

Review comment:
       I think this method should no longer exist, and there is no reason for it to be accessible to subclasses. It is also not a good idea to write methods like this one that have assumptions about other variables and operate by creating side-effects. It is less of a problem now that it is not called by subclasses. Now, there is just little reason for it to exist when this could be moved into the `@Before` method.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

chenjunjiedada commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r496456784



##########
File path: data/src/test/java/org/apache/iceberg/data/GenericReaderDeletesTest.java
##########
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.data;
+
+import java.io.File;
+import java.io.IOException;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TestTables;
+import org.apache.iceberg.io.CloseableIterable;
+import org.apache.iceberg.util.StructLikeSet;
+import org.junit.After;
+import org.junit.Before;
+
+public class GenericReaderDeletesTest extends DeletesReadTest {
+
+  @Override
+  public Table createTable(String name, Schema schema, PartitionSpec spec) throws IOException {
+    File tableDir = temp.newFolder();
+    tableDir.delete();
+
+    return TestTables.create(tableDir, name, schema, spec, 2);
+  }
+
+  @Override
+  public StructLikeSet rowSet(Table table, String... columns) throws IOException {
+    StructLikeSet set = StructLikeSet.create(table.schema().asStruct());
+    try (CloseableIterable<Record> reader = IcebergGenerics.read(table).select(columns).build()) {
+      reader.forEach(set::add);
+    }
+    return set;
+  }
+
+  @Before
+  public void writeTestDataFile() throws IOException {
+    this.table = createTable("test", SCHEMA, SPEC);
+    generateTestData();
+    table.newAppend()
+        .appendFile(dataFile)
+        .commit();

Review comment:
       Updated to the base class.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r493952629



##########
File path: mr/src/test/java/org/apache/iceberg/mr/TestMrReadDeletes.java
##########
@@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.Locale;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.BaseTable;
+import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.TableMetadata;
+import org.apache.iceberg.TableOperations;
+import org.apache.iceberg.data.DeletesReadTest;
+import org.apache.iceberg.data.Record;
+import org.apache.iceberg.hadoop.HadoopTables;
+import org.apache.iceberg.types.Types;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+import static org.apache.iceberg.types.Types.NestedField.required;
+
+@RunWith(Parameterized.class)
+public class TestMrReadDeletes extends DeletesReadTest {
+  // Schema passed to create tables
+  public static final Schema SCHEMA = new Schema(
+      required(1, "id", Types.IntegerType.get()),
+      required(2, "data", Types.StringType.get())
+  );
+
+  // Partition spec used to create tables
+  static final PartitionSpec SPEC = PartitionSpec.builderFor(SCHEMA)
+      .bucket("data", 16)
+      .build();
+
+  // parametrized variables
+  private final TestIcebergInputFormats.TestInputFormat.Factory<Record> testInputFormat;
+  private final FileFormat fileFormat;
+
+  @Parameterized.Parameters
+  public static Object[][] parameters() {
+    Object[][] parameters = new Object[TestIcebergInputFormats.TESTED_INPUT_FORMATS.size() *
+        TestIcebergInputFormats.TESTED_FILE_FORMATS.size()][2];
+
+    int idx = 0;
+
+    for (TestIcebergInputFormats.TestInputFormat.Factory<Record> inputFormat :
+        TestIcebergInputFormats.TESTED_INPUT_FORMATS) {
+      for (String fileFormat : TestIcebergInputFormats.TESTED_FILE_FORMATS) {
+        parameters[idx++] = new Object[] {inputFormat, fileFormat};
+      }
+    }
+
+    return parameters;

Review comment:
       Is there a simpler way to configure this? Normally, we build these using literals instead of a block of code.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

chenjunjiedada commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r497303303



##########
File path: mr/src/test/java/org/apache/iceberg/mr/TestIcebergInputFormats.java
##########
@@ -370,7 +370,7 @@ public void testCustomCatalog() throws IOException {
     testInputFormat.create(builder.conf()).validate(expectedRecords);
   }
 
-  private abstract static class TestInputFormat<T> {
+  public abstract static class TestInputFormat<T> {

Review comment:
       I tried to minimize the changes to access modifiers in `TestIcebergInputFormat`, the last place that needs the modifier of `TestInputFormat` to be public is `getRecords` method.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

chenjunjiedada commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r494016471



##########
File path: mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java
##########
@@ -248,6 +258,26 @@ public void close() throws IOException {
       return iterable;
     }
 
+    @SuppressWarnings("unchecked")
+    private CloseableIterable<T> open(FileScanTask currentTask, Schema readSchema) {
+      CloseableIterable<T> iter;
+      switch (inMemoryDataModel) {
+        case PIG:
+        case HIVE:
+          // TODO implement value readers for Pig and Hive
+          throw new UnsupportedOperationException("Avro support not yet supported for Pig and Hive");
+        case GENERIC:
+          DeleteFilter deletes = new GenericDeleteFilter(io, currentTask, tableSchema, readSchema);
+          Schema requiredSchema = deletes.requiredSchema();
+          iter = deletes.filter(openTask(currentTask, requiredSchema));

Review comment:
       Updated.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r498379864



##########
File path: data/src/test/java/org/apache/iceberg/data/DeletesReadTest.java
##########
@@ -25,31 +25,50 @@
 import org.apache.iceberg.DataFile;
 import org.apache.iceberg.DeleteFile;
 import org.apache.iceberg.Files;
+import org.apache.iceberg.PartitionSpec;
 import org.apache.iceberg.Schema;
 import org.apache.iceberg.Table;
-import org.apache.iceberg.TableTestBase;
 import org.apache.iceberg.TestHelpers.Row;
-import org.apache.iceberg.io.CloseableIterable;
 import org.apache.iceberg.relocated.com.google.common.collect.Lists;
 import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.types.Types;
 import org.apache.iceberg.util.ArrayUtil;
 import org.apache.iceberg.util.Pair;
 import org.apache.iceberg.util.StructLikeSet;
 import org.apache.iceberg.util.StructProjection;
+import org.junit.After;
 import org.junit.Assert;
 import org.junit.Before;
+import org.junit.Rule;
 import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
 
-public class TestGenericReaderDeletes extends TableTestBase {
-  public TestGenericReaderDeletes() {
-    super(2 /* format v2 with delete files */);
-  }
+import static org.apache.iceberg.types.Types.NestedField.required;
+
+public abstract class DeletesReadTest {
+  // Schema passed to create tables
+  public static final Schema SCHEMA = new Schema(
+      required(1, "id", Types.IntegerType.get()),
+      required(2, "data", Types.StringType.get())
+  );
+
+  // Partition spec used to create tables
+  public static final PartitionSpec SPEC = PartitionSpec.builderFor(SCHEMA)
+      .bucket("data", 16)
+      .build();
+
+  protected final String testTableName = "test";
+  protected Table table;
 
-  private List<Record> records = null;
-  private DataFile dataFile = null;
+  private DataFile dataFile;
+  private List<Record> records;

Review comment:
       Changing the order of these two lines and dropping the default also causes unnecessary changes.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r496887810



##########
File path: mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java
##########
@@ -250,6 +252,22 @@ public void close() throws IOException {
       return iterable;
     }
 
+    @SuppressWarnings("unchecked")
+    private CloseableIterable<T> open(FileScanTask currentTask, Schema readSchema) {
+      switch (inMemoryDataModel) {
+        case PIG:
+        case HIVE:
+          // TODO implement value readers for Pig and Hive
+          throw new UnsupportedOperationException("Pig and Hive are not supported for any format");

Review comment:
       This should be clear that Pig and Hive object models are not supported. Pig and Hive engines can read using generics.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#issuecomment-702883822


   @chenjunjiedada, I went ahead and fixed the remaining issues and opened a PR against your branch. Could you please take a look?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

chenjunjiedada commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r494015968



##########
File path: spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeletes.java
##########
@@ -93,212 +102,4 @@ public void createTable() throws IOException {
   public void dropTable() {
     catalog.dropTable(TableIdentifier.of("default", "table"));
   }
-
-  @Test
-  public void testEqualityDeletes() throws IOException {
-    Schema deleteRowSchema = table.schema().select("data");
-    Record dataDelete = GenericRecord.create(deleteRowSchema);
-    List<Record> dataDeletes = Lists.newArrayList(
-        dataDelete.copy("data", "a"), // id = 29
-        dataDelete.copy("data", "d"), // id = 89
-        dataDelete.copy("data", "g") // id = 122
-    );
-
-    DeleteFile eqDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), dataDeletes, deleteRowSchema);
-
-    table.newRowDelta()
-        .addDeletes(eqDeletes)
-        .commit();
-
-    StructLikeSet expected = rowSetWithoutIds(29, 89, 122);
-    StructLikeSet actual = rowSet(table);
-
-    Assert.assertEquals("Table should contain expected rows", expected, actual);
-  }
-
-  @Test
-  public void testEqualityDeletesWithRequiredEqColumn() throws IOException {
-    Schema deleteRowSchema = table.schema().select("data");
-    Record dataDelete = GenericRecord.create(deleteRowSchema);
-    List<Record> dataDeletes = Lists.newArrayList(
-        dataDelete.copy("data", "a"), // id = 29
-        dataDelete.copy("data", "d"), // id = 89
-        dataDelete.copy("data", "g") // id = 122
-    );
-
-    DeleteFile eqDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), dataDeletes, deleteRowSchema);
-
-    table.newRowDelta()
-        .addDeletes(eqDeletes)
-        .commit();
-
-    StructLikeSet expected = selectColumns(rowSetWithoutIds(29, 89, 122), "id");
-    StructLikeSet actual = rowSet(table, "id"); // data is added by the reader to apply the eq deletes
-
-    Assert.assertEquals("Table should contain expected rows", expected, actual);
-  }
-
-  @Test
-  public void testPositionDeletes() throws IOException {
-    List<Pair<CharSequence, Long>> deletes = Lists.newArrayList(
-        Pair.of(dataFile.path(), 0L), // id = 29
-        Pair.of(dataFile.path(), 3L), // id = 89
-        Pair.of(dataFile.path(), 6L) // id = 122
-    );
-
-    DeleteFile posDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), deletes);
-
-    table.newRowDelta()
-        .addDeletes(posDeletes)
-        .commit();
-
-    StructLikeSet expected = rowSetWithoutIds(29, 89, 122);
-    StructLikeSet actual = rowSet(table);
-
-    Assert.assertEquals("Table should contain expected rows", expected, actual);
-  }
-
-  @Test
-  public void testMixedPositionAndEqualityDeletes() throws IOException {
-    Schema dataSchema = table.schema().select("data");
-    Record dataDelete = GenericRecord.create(dataSchema);
-    List<Record> dataDeletes = Lists.newArrayList(
-        dataDelete.copy("data", "a"), // id = 29
-        dataDelete.copy("data", "d"), // id = 89
-        dataDelete.copy("data", "g") // id = 122
-    );
-
-    DeleteFile eqDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), dataDeletes, dataSchema);
-
-    List<Pair<CharSequence, Long>> deletes = Lists.newArrayList(
-        Pair.of(dataFile.path(), 3L), // id = 89
-        Pair.of(dataFile.path(), 5L) // id = 121
-    );
-
-    DeleteFile posDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), deletes);
-
-    table.newRowDelta()
-        .addDeletes(eqDeletes)
-        .addDeletes(posDeletes)
-        .commit();
-
-    StructLikeSet expected = rowSetWithoutIds(29, 89, 121, 122);
-    StructLikeSet actual = rowSet(table);
-
-    Assert.assertEquals("Table should contain expected rows", expected, actual);
-  }
-
-  @Test
-  public void testMultipleEqualityDeleteSchemas() throws IOException {
-    Schema dataSchema = table.schema().select("data");
-    Record dataDelete = GenericRecord.create(dataSchema);
-    List<Record> dataDeletes = Lists.newArrayList(
-        dataDelete.copy("data", "a"), // id = 29
-        dataDelete.copy("data", "d"), // id = 89
-        dataDelete.copy("data", "g") // id = 122
-    );
-
-    DeleteFile dataEqDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), dataDeletes, dataSchema);
-
-    Schema idSchema = table.schema().select("id");
-    Record idDelete = GenericRecord.create(idSchema);
-    List<Record> idDeletes = Lists.newArrayList(
-        idDelete.copy("id", 121), // id = 121
-        idDelete.copy("id", 29) // id = 29
-    );
-
-    DeleteFile idEqDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), idDeletes, idSchema);
-
-    table.newRowDelta()
-        .addDeletes(dataEqDeletes)
-        .addDeletes(idEqDeletes)
-        .commit();
-
-    StructLikeSet expected = rowSetWithoutIds(29, 89, 121, 122);
-    StructLikeSet actual = rowSet(table);
-
-    Assert.assertEquals("Table should contain expected rows", expected, actual);
-  }
-
-  @Test
-  public void testEqualityDeleteByNull() throws IOException {
-    // data is required in the test table; make it optional for this test
-    table.updateSchema()
-        .makeColumnOptional("data")
-        .commit();
-
-    // add a new data file with a record where data is null
-    Record record = GenericRecord.create(table.schema());
-    DataFile dataFileWithNull = FileHelpers.writeDataFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0),
-        Lists.newArrayList(record.copy("id", 131, "data", null)));
-
-    table.newAppend()
-        .appendFile(dataFileWithNull)
-        .commit();
-
-    // delete where data is null
-    Schema dataSchema = table.schema().select("data");
-    Record dataDelete = GenericRecord.create(dataSchema);
-    List<Record> dataDeletes = Lists.newArrayList(
-        dataDelete.copy("data", null) // id = 131
-    );
-
-    DeleteFile eqDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), dataDeletes, dataSchema);
-
-    table.newRowDelta()
-        .addDeletes(eqDeletes)
-        .commit();
-
-    StructLikeSet expected = rowSetWithoutIds(131);
-    StructLikeSet actual = rowSet(table);
-
-    Assert.assertEquals("Table should contain expected rows", expected, actual);
-  }
-
-  private static StructLikeSet rowSet(Table table) {
-    return rowSet(table, "*");
-  }
-
-  private static StructLikeSet rowSet(Table table, String... columns) {

Review comment:
       Sorry, I missed this. I just added this back and also use input format to read records.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r496880204



##########
File path: data/src/test/java/org/apache/iceberg/data/DeletesReadTest.java
##########
@@ -25,53 +25,68 @@
 import org.apache.iceberg.DataFile;
 import org.apache.iceberg.DeleteFile;
 import org.apache.iceberg.Files;
+import org.apache.iceberg.PartitionSpec;
 import org.apache.iceberg.Schema;
 import org.apache.iceberg.Table;
-import org.apache.iceberg.TableTestBase;
 import org.apache.iceberg.TestHelpers.Row;
-import org.apache.iceberg.io.CloseableIterable;
 import org.apache.iceberg.relocated.com.google.common.collect.Lists;
 import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.types.Types;
 import org.apache.iceberg.util.ArrayUtil;
 import org.apache.iceberg.util.Pair;
 import org.apache.iceberg.util.StructLikeSet;
 import org.apache.iceberg.util.StructProjection;
+import org.junit.After;
 import org.junit.Assert;
 import org.junit.Before;
+import org.junit.Rule;
 import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
 
-public class TestGenericReaderDeletes extends TableTestBase {
-  public TestGenericReaderDeletes() {
-    super(2 /* format v2 with delete files */);
-  }
+import static org.apache.iceberg.types.Types.NestedField.required;
 
-  private List<Record> records = null;
-  private DataFile dataFile = null;
+public abstract class DeletesReadTest {
+  // Schema passed to create tables
+  public static final Schema SCHEMA = new Schema(
+      required(1, "id", Types.IntegerType.get()),
+      required(2, "data", Types.StringType.get())
+  );
 
-  @Before
-  public void writeTestDataFile() throws IOException {
-    this.records = Lists.newArrayList();
+  // Partition spec used to create tables
+  public static final PartitionSpec SPEC = PartitionSpec.builderFor(SCHEMA)
+      .bucket("data", 16)
+      .build();
 
-    // records all use IDs that are in bucket id_bucket=0
-    GenericRecord record = GenericRecord.create(table.schema());
-    records.add(record.copy("id", 29, "data", "a"));
-    records.add(record.copy("id", 43, "data", "b"));
-    records.add(record.copy("id", 61, "data", "c"));
-    records.add(record.copy("id", 89, "data", "d"));
-    records.add(record.copy("id", 100, "data", "e"));
-    records.add(record.copy("id", 121, "data", "f"));
-    records.add(record.copy("id", 122, "data", "g"));
+  protected final String testTableName = "test";
+  protected Table testTable;
+  protected DataFile dataFile;

Review comment:
       Why are these protected and not private?
   
   My earlier suggestion was to avoid sharing fields with subclasses because the tests are all in this class. Only creating a table and reading it needs to be customized by subclasses, and we can pass in anything necessary to do that.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r498379505



##########
File path: data/src/test/java/org/apache/iceberg/data/DeletesReadTest.java
##########
@@ -25,31 +25,50 @@
 import org.apache.iceberg.DataFile;
 import org.apache.iceberg.DeleteFile;
 import org.apache.iceberg.Files;
+import org.apache.iceberg.PartitionSpec;
 import org.apache.iceberg.Schema;
 import org.apache.iceberg.Table;
-import org.apache.iceberg.TableTestBase;
 import org.apache.iceberg.TestHelpers.Row;
-import org.apache.iceberg.io.CloseableIterable;
 import org.apache.iceberg.relocated.com.google.common.collect.Lists;
 import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.types.Types;
 import org.apache.iceberg.util.ArrayUtil;
 import org.apache.iceberg.util.Pair;
 import org.apache.iceberg.util.StructLikeSet;
 import org.apache.iceberg.util.StructProjection;
+import org.junit.After;
 import org.junit.Assert;
 import org.junit.Before;
+import org.junit.Rule;
 import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
 
-public class TestGenericReaderDeletes extends TableTestBase {
-  public TestGenericReaderDeletes() {
-    super(2 /* format v2 with delete files */);
-  }
+import static org.apache.iceberg.types.Types.NestedField.required;
+
+public abstract class DeletesReadTest {
+  // Schema passed to create tables
+  public static final Schema SCHEMA = new Schema(
+      required(1, "id", Types.IntegerType.get()),
+      required(2, "data", Types.StringType.get())
+  );
+
+  // Partition spec used to create tables
+  public static final PartitionSpec SPEC = PartitionSpec.builderFor(SCHEMA)
+      .bucket("data", 16)
+      .build();
+
+  protected final String testTableName = "test";
+  protected Table table;
 
-  private List<Record> records = null;
-  private DataFile dataFile = null;
+  private DataFile dataFile;
+  private List<Record> records;
+
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
 
   @Before
-  public void writeTestDataFile() throws IOException {
+  public void prepareData() throws IOException {

Review comment:
       Nit: unnecessary method renames cause more changed lines than needed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r493949680



##########
File path: mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java
##########
@@ -129,7 +133,7 @@
           // TODO: We do not support residual evaluation for HIVE and PIG in memory data model yet
           checkResiduals(task);
         }
-        splits.add(new IcebergSplit(conf, task));
+        splits.add(new IcebergSplit(conf, task, table.io(), table.encryption()));

Review comment:
       While I would like to get the encryption manager and io changes in, I don't think that they should be mixed into this commit. Was it necessary to do this for some reason?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r496279725



##########
File path: mr/src/test/java/org/apache/iceberg/mr/TestMrReadDeletes.java
##########
@@ -0,0 +1,103 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr;
+
+import java.io.File;
+import java.io.IOException;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.BaseTable;
+import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TableMetadata;
+import org.apache.iceberg.TableOperations;
+import org.apache.iceberg.data.DeletesReadTest;
+import org.apache.iceberg.data.Record;
+import org.apache.iceberg.hadoop.HadoopTables;
+import org.apache.iceberg.util.StructLikeSet;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+@RunWith(Parameterized.class)
+public class TestMrReadDeletes extends DeletesReadTest {
+  private TestHelper helper;
+  private InputFormatConfig.ConfigBuilder builder;
+  private Configuration conf;
+
+  // parametrized variables
+  private final TestIcebergInputFormats.TestInputFormat.Factory<Record> testInputFormat;
+  private final FileFormat fileFormat;
+
+  @Parameterized.Parameters
+  public static Object[][] parameters() {
+    return new Object[][] {
+        new Object[] { "IcebergInputFormat", FileFormat.PARQUET },
+        new Object[] { "IcebergInputFormat", FileFormat.AVRO },
+        new Object[] { "IcebergInputFormat", FileFormat.ORC },
+        new Object[] { "MapredIcebergInputFormat", FileFormat.PARQUET },
+        new Object[] { "MapredIcebergInputFormat", FileFormat.AVRO },
+        new Object[] { "MapredIcebergInputFormat", FileFormat.ORC },
+    };
+  }
+
+  @Override
+  public Table createTable(String name, Schema schema, PartitionSpec spec) throws IOException {
+    Table table;
+    conf = new Configuration();

Review comment:
       Why does this create a new configuration?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r496274847



##########
File path: data/src/test/java/org/apache/iceberg/data/DeletesReadTest.java
##########
@@ -25,49 +25,45 @@
 import org.apache.iceberg.DataFile;
 import org.apache.iceberg.DeleteFile;
 import org.apache.iceberg.Files;
+import org.apache.iceberg.PartitionSpec;
 import org.apache.iceberg.Schema;
 import org.apache.iceberg.Table;
-import org.apache.iceberg.TableTestBase;
 import org.apache.iceberg.TestHelpers.Row;
-import org.apache.iceberg.io.CloseableIterable;
 import org.apache.iceberg.relocated.com.google.common.collect.Lists;
 import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.types.Types;
 import org.apache.iceberg.util.ArrayUtil;
 import org.apache.iceberg.util.Pair;
 import org.apache.iceberg.util.StructLikeSet;
 import org.apache.iceberg.util.StructProjection;
 import org.junit.Assert;
-import org.junit.Before;
+import org.junit.Rule;
 import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
 
-public class TestGenericReaderDeletes extends TableTestBase {
-  public TestGenericReaderDeletes() {
-    super(2 /* format v2 with delete files */);
-  }
+import static org.apache.iceberg.types.Types.NestedField.required;
 
-  private List<Record> records = null;
-  private DataFile dataFile = null;
+public abstract class DeletesReadTest {
+  // Schema passed to create tables
+  public static final Schema SCHEMA = new Schema(
+      required(1, "id", Types.IntegerType.get()),
+      required(2, "data", Types.StringType.get())
+  );
 
-  @Before
-  public void writeTestDataFile() throws IOException {
-    this.records = Lists.newArrayList();
+  // Partition spec used to create tables
+  public static final PartitionSpec SPEC = PartitionSpec.builderFor(SCHEMA)
+      .bucket("data", 16)
+      .build();
 
-    // records all use IDs that are in bucket id_bucket=0
-    GenericRecord record = GenericRecord.create(table.schema());
-    records.add(record.copy("id", 29, "data", "a"));
-    records.add(record.copy("id", 43, "data", "b"));
-    records.add(record.copy("id", 61, "data", "c"));
-    records.add(record.copy("id", 89, "data", "d"));
-    records.add(record.copy("id", 100, "data", "e"));
-    records.add(record.copy("id", 121, "data", "f"));
-    records.add(record.copy("id", 122, "data", "g"));
+  protected Table table;
+  protected DataFile dataFile;
 
-    this.dataFile = FileHelpers.writeDataFile(table, Files.localOutput(temp.newFile()), Row.of(0), records);
+  private List<Record> records;
 
-    table.newAppend()
-        .appendFile(dataFile)
-        .commit();
-  }
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
+
+  public abstract Table createTable(String name, Schema schema, PartitionSpec spec) throws IOException;

Review comment:
       Why is this `public` and not `protected` to expose it to child classes?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

chenjunjiedada commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r494015472



##########
File path: mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java
##########
@@ -248,6 +258,26 @@ public void close() throws IOException {
       return iterable;
     }
 
+    @SuppressWarnings("unchecked")

Review comment:
       `deletes.filter(...)` needs this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r493948979



##########
File path: data/src/test/java/org/apache/iceberg/data/DeletesReadTest.java
##########
@@ -269,4 +265,5 @@ private StructLikeSet rowSetWithoutIds(int... idsToRemove) {
         .forEach(set::add);
     return set;
   }
+

Review comment:
       Nit: unnecessary whitespace change




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r493948922



##########
File path: data/src/test/java/org/apache/iceberg/data/DeletesReadTest.java
##########
@@ -148,15 +144,15 @@ public void testMixedPositionAndEqualityDeletes() throws IOException {
     );
 
     DeleteFile eqDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), dataDeletes, dataSchema);
+        table, Files.localOutput(temp.newFile()), TestHelpers.Row.of(0), dataDeletes, dataSchema);

Review comment:
       Can you import this class directly to avoid so many changes in this file?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

chenjunjiedada commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r496455338



##########
File path: spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeletes.java
##########
@@ -20,285 +20,111 @@
 package org.apache.iceberg.spark.source;
 
 import java.io.IOException;
-import java.util.List;
-import java.util.Set;
+import org.apache.hadoop.hive.conf.HiveConf;
 import org.apache.iceberg.BaseTable;
-import org.apache.iceberg.DataFile;
-import org.apache.iceberg.DeleteFile;
-import org.apache.iceberg.Files;
+import org.apache.iceberg.PartitionSpec;
 import org.apache.iceberg.Schema;
 import org.apache.iceberg.Table;
 import org.apache.iceberg.TableMetadata;
 import org.apache.iceberg.TableOperations;
-import org.apache.iceberg.TestHelpers.Row;
+import org.apache.iceberg.catalog.Namespace;
 import org.apache.iceberg.catalog.TableIdentifier;
-import org.apache.iceberg.data.FileHelpers;
-import org.apache.iceberg.data.GenericRecord;
-import org.apache.iceberg.data.Record;
-import org.apache.iceberg.relocated.com.google.common.collect.Lists;
-import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.data.DeletesReadTest;
+import org.apache.iceberg.exceptions.AlreadyExistsException;
+import org.apache.iceberg.hive.HiveCatalog;
+import org.apache.iceberg.hive.TestHiveMetastore;
 import org.apache.iceberg.spark.SparkStructLike;
-import org.apache.iceberg.spark.SparkTestBase;
 import org.apache.iceberg.types.Types;
-import org.apache.iceberg.util.ArrayUtil;
-import org.apache.iceberg.util.Pair;
 import org.apache.iceberg.util.StructLikeSet;
-import org.apache.iceberg.util.StructProjection;
 import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.internal.SQLConf;
 import org.junit.After;
-import org.junit.Assert;
+import org.junit.AfterClass;
 import org.junit.Before;
-import org.junit.Rule;
-import org.junit.Test;
-import org.junit.rules.TemporaryFolder;
+import org.junit.BeforeClass;
 
-public abstract class TestSparkReaderDeletes extends SparkTestBase {
-  private static final Schema SCHEMA = new Schema(
-      Types.NestedField.required(1, "id", Types.IntegerType.get()),
-      Types.NestedField.required(2, "data", Types.StringType.get()));
-  private Table table = null;
-  private List<Record> records = null;
-  private DataFile dataFile = null;
+import static org.apache.hadoop.hive.conf.HiveConf.ConfVars.METASTOREURIS;
 
-  @Rule
-  public TemporaryFolder temp = new TemporaryFolder();
+public abstract class TestSparkReaderDeletes extends DeletesReadTest {
 
-  @Before
-  public void createTable() throws IOException {
-    this.table = catalog.createTable(TableIdentifier.of("default", "table"), SCHEMA);
-    TableOperations ops = ((BaseTable) table).operations();
-    TableMetadata meta = ops.current();
-    ops.commit(meta, meta.upgradeToFormatVersion(2));
-
-    this.records = Lists.newArrayList();
+  private static TestHiveMetastore metastore = null;
+  protected static SparkSession spark = null;
+  protected static HiveCatalog catalog = null;
 
-    // records all use IDs that are in bucket id_bucket=0
-    GenericRecord record = GenericRecord.create(table.schema());
-    records.add(record.copy("id", 29, "data", "a"));
-    records.add(record.copy("id", 43, "data", "b"));
-    records.add(record.copy("id", 61, "data", "c"));
-    records.add(record.copy("id", 89, "data", "d"));
-    records.add(record.copy("id", 100, "data", "e"));
-    records.add(record.copy("id", 121, "data", "f"));
-    records.add(record.copy("id", 122, "data", "g"));
+  @BeforeClass
+  public static void startMetastoreAndSpark() {
+    metastore = new TestHiveMetastore();
+    metastore.start();
+    HiveConf hiveConf = metastore.hiveConf();
 
-    this.dataFile = FileHelpers.writeDataFile(table, Files.localOutput(temp.newFile()), Row.of(0), records);
+    spark = SparkSession.builder()
+        .master("local[2]")
+        .config(SQLConf.PARTITION_OVERWRITE_MODE().key(), "dynamic")
+        .config("spark.hadoop." + METASTOREURIS.varname, hiveConf.get(METASTOREURIS.varname))
+        .enableHiveSupport()
+        .getOrCreate();
 
-    table.newAppend()
-        .appendFile(dataFile)
-        .commit();
-  }
+    catalog = new HiveCatalog(spark.sessionState().newHadoopConf());
 
-  @After
-  public void dropTable() {
-    catalog.dropTable(TableIdentifier.of("default", "table"));
+    try {
+      catalog.createNamespace(Namespace.of("default"));
+    } catch (AlreadyExistsException ignored) {
+      // the default namespace already exists. ignore the create error
+    }
   }
 
-  @Test
-  public void testEqualityDeletes() throws IOException {
-    Schema deleteRowSchema = table.schema().select("data");
-    Record dataDelete = GenericRecord.create(deleteRowSchema);
-    List<Record> dataDeletes = Lists.newArrayList(
-        dataDelete.copy("data", "a"), // id = 29
-        dataDelete.copy("data", "d"), // id = 89
-        dataDelete.copy("data", "g") // id = 122
-    );
-
-    DeleteFile eqDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), dataDeletes, deleteRowSchema);
-
-    table.newRowDelta()
-        .addDeletes(eqDeletes)
-        .commit();
-
-    StructLikeSet expected = rowSetWithoutIds(29, 89, 122);
-    StructLikeSet actual = rowSet(table);
-
-    Assert.assertEquals("Table should contain expected rows", expected, actual);
+  @AfterClass
+  public static void stopMetastoreAndSpark() {
+    catalog.close();
+    catalog = null;
+    metastore.stop();
+    metastore = null;
+    spark.stop();
+    spark = null;
   }
 
-  @Test
-  public void testEqualityDeletesWithRequiredEqColumn() throws IOException {
-    Schema deleteRowSchema = table.schema().select("data");
-    Record dataDelete = GenericRecord.create(deleteRowSchema);
-    List<Record> dataDeletes = Lists.newArrayList(
-        dataDelete.copy("data", "a"), // id = 29
-        dataDelete.copy("data", "d"), // id = 89
-        dataDelete.copy("data", "g") // id = 122
-    );
-
-    DeleteFile eqDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), dataDeletes, deleteRowSchema);
-
-    table.newRowDelta()
-        .addDeletes(eqDeletes)
-        .commit();
-
-    StructLikeSet expected = selectColumns(rowSetWithoutIds(29, 89, 122), "id");
-    StructLikeSet actual = rowSet(table, "id"); // data is added by the reader to apply the eq deletes
-
-    Assert.assertEquals("Table should contain expected rows", expected, actual);
-  }
-
-  @Test
-  public void testPositionDeletes() throws IOException {
-    List<Pair<CharSequence, Long>> deletes = Lists.newArrayList(
-        Pair.of(dataFile.path(), 0L), // id = 29
-        Pair.of(dataFile.path(), 3L), // id = 89
-        Pair.of(dataFile.path(), 6L) // id = 122
-    );
-
-    DeleteFile posDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), deletes);
-
-    table.newRowDelta()
-        .addDeletes(posDeletes)
-        .commit();
-
-    StructLikeSet expected = rowSetWithoutIds(29, 89, 122);
-    StructLikeSet actual = rowSet(table);
-
-    Assert.assertEquals("Table should contain expected rows", expected, actual);
-  }
-
-  @Test
-  public void testMixedPositionAndEqualityDeletes() throws IOException {
-    Schema dataSchema = table.schema().select("data");
-    Record dataDelete = GenericRecord.create(dataSchema);
-    List<Record> dataDeletes = Lists.newArrayList(
-        dataDelete.copy("data", "a"), // id = 29
-        dataDelete.copy("data", "d"), // id = 89
-        dataDelete.copy("data", "g") // id = 122
-    );
-
-    DeleteFile eqDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), dataDeletes, dataSchema);
-
-    List<Pair<CharSequence, Long>> deletes = Lists.newArrayList(
-        Pair.of(dataFile.path(), 3L), // id = 89
-        Pair.of(dataFile.path(), 5L) // id = 121
-    );
+  @Before
+  public void prepareData() throws IOException {
+    this.table = createTable("table", SCHEMA, SPEC);

Review comment:
       Make sense, Updated




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

chenjunjiedada commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r496458278



##########
File path: mr/src/test/java/org/apache/iceberg/mr/TestMrReadDeletes.java
##########
@@ -0,0 +1,103 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.mr;
+
+import java.io.File;
+import java.io.IOException;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.iceberg.BaseTable;
+import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TableMetadata;
+import org.apache.iceberg.TableOperations;
+import org.apache.iceberg.data.DeletesReadTest;
+import org.apache.iceberg.data.Record;
+import org.apache.iceberg.hadoop.HadoopTables;
+import org.apache.iceberg.util.StructLikeSet;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+@RunWith(Parameterized.class)
+public class TestMrReadDeletes extends DeletesReadTest {
+  private TestHelper helper;
+  private InputFormatConfig.ConfigBuilder builder;
+  private Configuration conf;
+
+  // parametrized variables
+  private final TestIcebergInputFormats.TestInputFormat.Factory<Record> testInputFormat;
+  private final FileFormat fileFormat;
+
+  @Parameterized.Parameters
+  public static Object[][] parameters() {
+    return new Object[][] {
+        new Object[] { "IcebergInputFormat", FileFormat.PARQUET },
+        new Object[] { "IcebergInputFormat", FileFormat.AVRO },
+        new Object[] { "IcebergInputFormat", FileFormat.ORC },
+        new Object[] { "MapredIcebergInputFormat", FileFormat.PARQUET },
+        new Object[] { "MapredIcebergInputFormat", FileFormat.AVRO },
+        new Object[] { "MapredIcebergInputFormat", FileFormat.ORC },
+    };
+  }
+
+  @Override
+  public Table createTable(String name, Schema schema, PartitionSpec spec) throws IOException {
+    Table table;
+    conf = new Configuration();

Review comment:
       Hmm, it should be allocated once. I moved allocation to class level.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

chenjunjiedada commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r496365105



##########
File path: mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java
##########
@@ -129,7 +133,7 @@
           // TODO: We do not support residual evaluation for HIVE and PIG in memory data model yet
           checkResiduals(task);
         }
-        splits.add(new IcebergSplit(conf, task));
+        splits.add(new IcebergSplit(conf, task, table.io(), table.encryption()));

Review comment:
       Thank you @holdenk @rdblue. I updated this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r493954344



##########
File path: spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeletes.java
##########
@@ -93,212 +102,4 @@ public void createTable() throws IOException {
   public void dropTable() {
     catalog.dropTable(TableIdentifier.of("default", "table"));
   }
-
-  @Test
-  public void testEqualityDeletes() throws IOException {
-    Schema deleteRowSchema = table.schema().select("data");
-    Record dataDelete = GenericRecord.create(deleteRowSchema);
-    List<Record> dataDeletes = Lists.newArrayList(
-        dataDelete.copy("data", "a"), // id = 29
-        dataDelete.copy("data", "d"), // id = 89
-        dataDelete.copy("data", "g") // id = 122
-    );
-
-    DeleteFile eqDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), dataDeletes, deleteRowSchema);
-
-    table.newRowDelta()
-        .addDeletes(eqDeletes)
-        .commit();
-
-    StructLikeSet expected = rowSetWithoutIds(29, 89, 122);
-    StructLikeSet actual = rowSet(table);
-
-    Assert.assertEquals("Table should contain expected rows", expected, actual);
-  }
-
-  @Test
-  public void testEqualityDeletesWithRequiredEqColumn() throws IOException {
-    Schema deleteRowSchema = table.schema().select("data");
-    Record dataDelete = GenericRecord.create(deleteRowSchema);
-    List<Record> dataDeletes = Lists.newArrayList(
-        dataDelete.copy("data", "a"), // id = 29
-        dataDelete.copy("data", "d"), // id = 89
-        dataDelete.copy("data", "g") // id = 122
-    );
-
-    DeleteFile eqDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), dataDeletes, deleteRowSchema);
-
-    table.newRowDelta()
-        .addDeletes(eqDeletes)
-        .commit();
-
-    StructLikeSet expected = selectColumns(rowSetWithoutIds(29, 89, 122), "id");
-    StructLikeSet actual = rowSet(table, "id"); // data is added by the reader to apply the eq deletes
-
-    Assert.assertEquals("Table should contain expected rows", expected, actual);
-  }
-
-  @Test
-  public void testPositionDeletes() throws IOException {
-    List<Pair<CharSequence, Long>> deletes = Lists.newArrayList(
-        Pair.of(dataFile.path(), 0L), // id = 29
-        Pair.of(dataFile.path(), 3L), // id = 89
-        Pair.of(dataFile.path(), 6L) // id = 122
-    );
-
-    DeleteFile posDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), deletes);
-
-    table.newRowDelta()
-        .addDeletes(posDeletes)
-        .commit();
-
-    StructLikeSet expected = rowSetWithoutIds(29, 89, 122);
-    StructLikeSet actual = rowSet(table);
-
-    Assert.assertEquals("Table should contain expected rows", expected, actual);
-  }
-
-  @Test
-  public void testMixedPositionAndEqualityDeletes() throws IOException {
-    Schema dataSchema = table.schema().select("data");
-    Record dataDelete = GenericRecord.create(dataSchema);
-    List<Record> dataDeletes = Lists.newArrayList(
-        dataDelete.copy("data", "a"), // id = 29
-        dataDelete.copy("data", "d"), // id = 89
-        dataDelete.copy("data", "g") // id = 122
-    );
-
-    DeleteFile eqDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), dataDeletes, dataSchema);
-
-    List<Pair<CharSequence, Long>> deletes = Lists.newArrayList(
-        Pair.of(dataFile.path(), 3L), // id = 89
-        Pair.of(dataFile.path(), 5L) // id = 121
-    );
-
-    DeleteFile posDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), deletes);
-
-    table.newRowDelta()
-        .addDeletes(eqDeletes)
-        .addDeletes(posDeletes)
-        .commit();
-
-    StructLikeSet expected = rowSetWithoutIds(29, 89, 121, 122);
-    StructLikeSet actual = rowSet(table);
-
-    Assert.assertEquals("Table should contain expected rows", expected, actual);
-  }
-
-  @Test
-  public void testMultipleEqualityDeleteSchemas() throws IOException {
-    Schema dataSchema = table.schema().select("data");
-    Record dataDelete = GenericRecord.create(dataSchema);
-    List<Record> dataDeletes = Lists.newArrayList(
-        dataDelete.copy("data", "a"), // id = 29
-        dataDelete.copy("data", "d"), // id = 89
-        dataDelete.copy("data", "g") // id = 122
-    );
-
-    DeleteFile dataEqDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), dataDeletes, dataSchema);
-
-    Schema idSchema = table.schema().select("id");
-    Record idDelete = GenericRecord.create(idSchema);
-    List<Record> idDeletes = Lists.newArrayList(
-        idDelete.copy("id", 121), // id = 121
-        idDelete.copy("id", 29) // id = 29
-    );
-
-    DeleteFile idEqDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), idDeletes, idSchema);
-
-    table.newRowDelta()
-        .addDeletes(dataEqDeletes)
-        .addDeletes(idEqDeletes)
-        .commit();
-
-    StructLikeSet expected = rowSetWithoutIds(29, 89, 121, 122);
-    StructLikeSet actual = rowSet(table);
-
-    Assert.assertEquals("Table should contain expected rows", expected, actual);
-  }
-
-  @Test
-  public void testEqualityDeleteByNull() throws IOException {
-    // data is required in the test table; make it optional for this test
-    table.updateSchema()
-        .makeColumnOptional("data")
-        .commit();
-
-    // add a new data file with a record where data is null
-    Record record = GenericRecord.create(table.schema());
-    DataFile dataFileWithNull = FileHelpers.writeDataFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0),
-        Lists.newArrayList(record.copy("id", 131, "data", null)));
-
-    table.newAppend()
-        .appendFile(dataFileWithNull)
-        .commit();
-
-    // delete where data is null
-    Schema dataSchema = table.schema().select("data");
-    Record dataDelete = GenericRecord.create(dataSchema);
-    List<Record> dataDeletes = Lists.newArrayList(
-        dataDelete.copy("data", null) // id = 131
-    );
-
-    DeleteFile eqDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), dataDeletes, dataSchema);
-
-    table.newRowDelta()
-        .addDeletes(eqDeletes)
-        .commit();
-
-    StructLikeSet expected = rowSetWithoutIds(131);
-    StructLikeSet actual = rowSet(table);
-
-    Assert.assertEquals("Table should contain expected rows", expected, actual);
-  }
-
-  private static StructLikeSet rowSet(Table table) {
-    return rowSet(table, "*");
-  }
-
-  private static StructLikeSet rowSet(Table table, String... columns) {

Review comment:
       This method is what reads the rows from the table using Spark. Deleting this method and using the one in `DeletesReadTest` makes this test suite use the exact same read path as the generics -- `IcebergGenerics`.
   
   You can probably make this method abstract and implement it in both classes to get around this. You'll also need to implement a read using the input format or Hive runner to test the Hive code.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r493951523



##########
File path: data/src/test/java/org/apache/iceberg/data/GenericReaderDeletesTest.java
##########
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.data;
+
+import java.io.File;
+import java.io.IOException;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.TestTables;
+import org.apache.iceberg.types.Types;
+import org.junit.After;
+import org.junit.Before;
+
+import static org.apache.iceberg.types.Types.NestedField.required;
+
+public class GenericReaderDeletesTest extends DeletesReadTest {
+  // Schema passed to create tables
+  public static final Schema SCHEMA = new Schema(
+      required(1, "id", Types.IntegerType.get()),
+      required(2, "data", Types.StringType.get())
+  );
+
+  // Partition spec used to create tables
+  static final PartitionSpec SPEC = PartitionSpec.builderFor(SCHEMA)
+      .bucket("data", 16)
+      .build();
+
+  @Before
+  public void writeTestDataFile() throws IOException {
+    File tableDir = temp.newFolder();
+    tableDir.delete();
+    this.table = TestTables.create(tableDir, "test", SCHEMA, SPEC, 2);

Review comment:
       I think a better way to break down the class would be to have an abstract `Table createTable(String name, Schema, Spec)` method. Then the `table` and `dataFile` fields don't need to be shared. I also don't think that there is a need to make `records` public either.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

chenjunjiedada commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r497138463



##########
File path: data/src/test/java/org/apache/iceberg/data/DeletesReadTest.java
##########
@@ -80,21 +95,37 @@ public void testEqualityDeletes() throws IOException {
     );
 
     DeleteFile eqDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), dataDeletes, deleteRowSchema);
+        testTable, Files.localOutput(temp.newFile()), Row.of(0), dataDeletes, deleteRowSchema);
 
-    table.newRowDelta()
+    testTable.newRowDelta()
         .addDeletes(eqDeletes)
         .commit();
 
     StructLikeSet expected = rowSetWithoutIds(29, 89, 122);
-    StructLikeSet actual = rowSet(table);
+    StructLikeSet actual = rowSet(testTable);
 
     Assert.assertEquals("Table should contain expected rows", expected, actual);
   }
 
+  protected void generateTestData() throws IOException {

Review comment:
       Okay, make sense to me. Moved to `@Before`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

chenjunjiedada commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r497141257



##########
File path: mr/src/test/java/org/apache/iceberg/mr/TestIcebergInputFormats.java
##########
@@ -370,7 +370,7 @@ public void testCustomCatalog() throws IOException {
     testInputFormat.create(builder.conf()).validate(expectedRecords);
   }
 
-  private abstract static class TestInputFormat<T> {
+  public abstract static class TestInputFormat<T> {

Review comment:
       This is because the `testInputFormat` is type of `TestIcebergInputFormats.TestInputFormat.Factory<Record>` which needs to access the `TestInputFormat`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r496889482



##########
File path: mr/src/test/java/org/apache/iceberg/mr/TestIcebergInputFormats.java
##########
@@ -370,7 +370,7 @@ public void testCustomCatalog() throws IOException {
     testInputFormat.create(builder.conf()).validate(expectedRecords);
   }
 
-  private abstract static class TestInputFormat<T> {
+  public abstract static class TestInputFormat<T> {

Review comment:
       Why was this changed to `public`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#issuecomment-703761919


   Merged. Thanks, @chenjunjiedada!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r496278848



##########
File path: mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java
##########
@@ -248,6 +258,22 @@ public void close() throws IOException {
       return iterable;
     }
 
+    @SuppressWarnings("unchecked")
+    private CloseableIterable<T> open(FileScanTask currentTask, Schema readSchema) {
+      switch (inMemoryDataModel) {
+        case PIG:
+        case HIVE:
+          // TODO implement value readers for Pig and Hive
+          throw new UnsupportedOperationException("Avro support not yet supported for Pig and Hive");

Review comment:
       This error message is for Avro, but I think it should be that Pig and Hive are not supported for any format.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r498383751



##########
File path: data/src/test/java/org/apache/iceberg/data/DeletesReadTest.java
##########
@@ -25,31 +25,50 @@
 import org.apache.iceberg.DataFile;
 import org.apache.iceberg.DeleteFile;
 import org.apache.iceberg.Files;
+import org.apache.iceberg.PartitionSpec;
 import org.apache.iceberg.Schema;
 import org.apache.iceberg.Table;
-import org.apache.iceberg.TableTestBase;
 import org.apache.iceberg.TestHelpers.Row;
-import org.apache.iceberg.io.CloseableIterable;
 import org.apache.iceberg.relocated.com.google.common.collect.Lists;
 import org.apache.iceberg.relocated.com.google.common.collect.Sets;
+import org.apache.iceberg.types.Types;
 import org.apache.iceberg.util.ArrayUtil;
 import org.apache.iceberg.util.Pair;
 import org.apache.iceberg.util.StructLikeSet;
 import org.apache.iceberg.util.StructProjection;
+import org.junit.After;
 import org.junit.Assert;
 import org.junit.Before;
+import org.junit.Rule;
 import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
 
-public class TestGenericReaderDeletes extends TableTestBase {
-  public TestGenericReaderDeletes() {
-    super(2 /* format v2 with delete files */);
-  }
+import static org.apache.iceberg.types.Types.NestedField.required;
+
+public abstract class DeletesReadTest {
+  // Schema passed to create tables
+  public static final Schema SCHEMA = new Schema(
+      required(1, "id", Types.IntegerType.get()),
+      required(2, "data", Types.StringType.get())
+  );
+
+  // Partition spec used to create tables
+  public static final PartitionSpec SPEC = PartitionSpec.builderFor(SCHEMA)
+      .bucket("data", 16)
+      .build();
+
+  protected final String testTableName = "test";

Review comment:
       I think this should be private. If it is needed by subclases, it should be passed into methods, not shared. I think this is only used by Spark, so it should be easy to fix.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] chenjunjiedada commented on a change in pull request #1497: MR: apply row-level delete files when reading

Posted by GitBox <gi...@apache.org>.

chenjunjiedada commented on a change in pull request #1497:
URL: https://github.com/apache/iceberg/pull/1497#discussion_r494016066



##########
File path: data/src/test/java/org/apache/iceberg/data/DeletesReadTest.java
##########
@@ -148,15 +144,15 @@ public void testMixedPositionAndEqualityDeletes() throws IOException {
     );
 
     DeleteFile eqDeletes = FileHelpers.writeDeleteFile(
-        table, Files.localOutput(temp.newFile()), Row.of(0), dataDeletes, dataSchema);
+        table, Files.localOutput(temp.newFile()), TestHelpers.Row.of(0), dataDeletes, dataSchema);

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org