You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/01/26 10:28:57 UTC

[GitHub] [incubator-iceberg] maqroll opened a new pull request #750: Add time-travel methods (asOfTime, useSnapshot) in IcebergGenerics

maqroll opened a new pull request #750: Add time-travel methods (asOfTime,useSnapshot) in IcebergGenerics
URL: https://github.com/apache/incubator-iceberg/pull/750
 
 
   Added time-travel methods (asOfTime, useSnapshot) in IcebergGenerics.
   
   I moved TableScan initialization to IcebergGenerics constructor so that time travel settings validation doesn't get deferred to build() call (as in TableScan).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #750: Add time-travel methods (asOfTime, useSnapshot) in IcebergGenerics

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #750: Add time-travel methods (asOfTime,useSnapshot) in IcebergGenerics
URL: https://github.com/apache/incubator-iceberg/pull/750#discussion_r371372669
 
 

 ##########
 File path: data/src/test/java/org/apache/iceberg/data/TestLocalScan.java
 ##########
 @@ -215,28 +275,90 @@ public void testFullScan() {
     Iterable<Record> results = IcebergGenerics.read(sharedTable).build();
 
     Set<Record> expected = Sets.newHashSet();
-    expected.addAll(file1Records);
-    expected.addAll(file2Records);
-    expected.addAll(file3Records);
+    expected.addAll(file1SecondSnapshotRecords);
+    expected.addAll(file2SecondSnapshotRecords);
+    expected.addAll(file3SecondSnapshotRecords);
+
+    Set<Record> records = Sets.newHashSet(results);
+    Assert.assertEquals("Should produce correct number of records",
+        expected.size(), records.size());
+    Assert.assertEquals("Random record set should match",
+        Sets.newHashSet(expected), records);
+    Assert.assertNotNull(records.stream().findFirst().get().getField("id"));
+    Assert.assertNotNull(records.stream().findFirst().get().getField("data"));
+  }
+
+  @Test
+  public void testUnknownSnapshotId() {
+    exceptionRule.expect(IllegalArgumentException.class);
+    exceptionRule.expectMessage(startsWith("Cannot find snapshot with ID "));
 
 Review comment:
   Instead of using `ExpectedException`, we use `AssertHelpers.assertThrows`. That also supports matching the contents of the thrown exception's message.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue merged pull request #750: Add time-travel methods (asOfTime, useSnapshot) in IcebergGenerics

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #750: Add time-travel methods (asOfTime,useSnapshot) in IcebergGenerics
URL: https://github.com/apache/incubator-iceberg/pull/750
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #750: Add time-travel methods (asOfTime, useSnapshot) in IcebergGenerics

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #750: Add time-travel methods (asOfTime,useSnapshot) in IcebergGenerics
URL: https://github.com/apache/incubator-iceberg/pull/750#discussion_r371491281
 
 

 ##########
 File path: data/src/main/java/org/apache/iceberg/data/IcebergGenerics.java
 ##########
 @@ -56,27 +61,33 @@ public ScanBuilder reuseContainers() {
     }
 
     public ScanBuilder where(Expression rowFilter) {
-      this.where = Expressions.and(where, rowFilter);
+      this.tableScan = this.tableScan.filter(Expressions.and(defaultWhere, rowFilter));
 
 Review comment:
   Agreed. I just don't think we should have two sets of scan defaults.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on issue #750: Add time-travel methods (asOfTime, useSnapshot) in IcebergGenerics

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #750: Add time-travel methods (asOfTime,useSnapshot) in IcebergGenerics
URL: https://github.com/apache/incubator-iceberg/pull/750#issuecomment-579859948
 
 
   +1
   
   Thanks for fixing this, @maqroll! Also, nice job fixing the review points. It is really helpful when you fix all of the instances of something (e.g. usage of `this.`) without needing to point out each one individually. Thanks!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #750: Add time-travel methods (asOfTime, useSnapshot) in IcebergGenerics

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #750: Add time-travel methods (asOfTime,useSnapshot) in IcebergGenerics
URL: https://github.com/apache/incubator-iceberg/pull/750#discussion_r371373886
 
 

 ##########
 File path: data/src/main/java/org/apache/iceberg/data/IcebergGenerics.java
 ##########
 @@ -56,27 +61,33 @@ public ScanBuilder reuseContainers() {
     }
 
     public ScanBuilder where(Expression rowFilter) {
-      this.where = Expressions.and(where, rowFilter);
+      this.tableScan = this.tableScan.filter(Expressions.and(defaultWhere, rowFilter));
 
 Review comment:
   The `TableScan` API is a refinement API, like Spark's data frames. When you filter, you get a new scan that has the new filter applied with the existing scan's filters. So there's no need to use `Expressions.and`. In fact, using `and` with `true` and another filter will just return the filter.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #750: Add time-travel methods (asOfTime, useSnapshot) in IcebergGenerics

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #750: Add time-travel methods (asOfTime,useSnapshot) in IcebergGenerics
URL: https://github.com/apache/incubator-iceberg/pull/750#discussion_r371374679
 
 

 ##########
 File path: data/src/main/java/org/apache/iceberg/data/IcebergGenerics.java
 ##########
 @@ -56,27 +61,33 @@ public ScanBuilder reuseContainers() {
     }
 
     public ScanBuilder where(Expression rowFilter) {
-      this.where = Expressions.and(where, rowFilter);
+      this.tableScan = this.tableScan.filter(Expressions.and(defaultWhere, rowFilter));
       return this;
     }
 
     public ScanBuilder caseInsensitive() {
-      this.caseSensitive = false;
+      this.tableScan = this.tableScan.caseSensitive(false);
       return this;
     }
 
     public ScanBuilder select(String... selectedColumns) {
-      this.columns = ImmutableList.copyOf(selectedColumns);
+      this.tableScan = this.tableScan.select(ImmutableList.copyOf(selectedColumns));
+      return this;
+    }
+
+    public ScanBuilder useSnapshot(long scanSnapshotId) {
+      this.tableScan = this.tableScan.useSnapshot(scanSnapshotId);
+      return this;
+    }
+
+    public ScanBuilder asOfTime(long scanTimestampMillis) {
+      this.tableScan = this.tableScan.asOfTime(scanTimestampMillis);
       return this;
     }
 
     public Iterable<Record> build() {
       return new TableScanIterable(
-        table
-          .newScan()
-          .filter(where)
-          .caseSensitive(caseSensitive)
-          .select(columns),
+        this.tableScan,
 
 Review comment:
   We only use `this.` when assigning to an instance field, so that it is clear from that line that the assignment is not to a local variable. We don't use `this.` when accessing the variable. Can you remove those?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #750: Add time-travel methods (asOfTime, useSnapshot) in IcebergGenerics

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #750: Add time-travel methods (asOfTime,useSnapshot) in IcebergGenerics
URL: https://github.com/apache/incubator-iceberg/pull/750#discussion_r371375805
 
 

 ##########
 File path: data/src/test/java/org/apache/iceberg/data/TestLocalScan.java
 ##########
 @@ -114,53 +122,105 @@ public void createTables() throws IOException {
 
     Record record = GenericRecord.create(SCHEMA);
 
-    this.file1Records = Lists.newArrayList(
+    this.file1FirstSnapshotRecords = Lists.newArrayList(
+            record.copy(ImmutableMap.of("id", 4L, "data", "obscure")),
 
 Review comment:
   Continuation indents should be 2 indents = 4 spaces, not 8 spaces.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #750: Add time-travel methods (asOfTime, useSnapshot) in IcebergGenerics

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #750: Add time-travel methods (asOfTime,useSnapshot) in IcebergGenerics
URL: https://github.com/apache/incubator-iceberg/pull/750#discussion_r372510717
 
 

 ##########
 File path: data/src/test/java/org/apache/iceberg/data/TestLocalScan.java
 ##########
 @@ -279,31 +323,107 @@ public void testProjectWithMissingFilterColumn() {
         Sets.newHashSet(transform(results, record -> record.getField("data").toString())));
   }
 
-  private InputFile writeFile(String location, String filename, List<Record> records) throws IOException {
+  @Test
+  public void testUseSnapshot() throws IOException {
+    overwriteExistingData();
+    Iterable<Record> results = IcebergGenerics.read(sharedTable)
+        .useSnapshot(/* first snapshot */ sharedTable.history().get(1).snapshotId())
+        .build();
+
+    Set<Record> expected = Sets.newHashSet();
+    expected.addAll(file1FirstSnapshotRecords);
+    expected.addAll(file2FirstSnapshotRecords);
+    expected.addAll(file3FirstSnapshotRecords);
+
+    Set<Record> records = Sets.newHashSet(results);
+    Assert.assertEquals("Should produce correct number of records",
+        expected.size(), records.size());
+    Assert.assertEquals("Record set should match",
+        Sets.newHashSet(expected), records);
+    Assert.assertNotNull(Iterables.get(records, 0).getField("id"));
+    Assert.assertNotNull(Iterables.get(records, 0).getField("data"));
+  }
+
+  @Test
+  public void testAsOfTime() throws IOException {
+    overwriteExistingData();
+    Iterable<Record> results = IcebergGenerics.read(sharedTable)
+        .asOfTime(/* timestamp first snapshot */ sharedTable.history().get(2).timestampMillis())
+        .build();
+
+    Set<Record> expected = Sets.newHashSet();
+    expected.addAll(file1SecondSnapshotRecords);
+    expected.addAll(file2SecondSnapshotRecords);
+    expected.addAll(file3SecondSnapshotRecords);
+
+    Set<Record> records = Sets.newHashSet(results);
+    Assert.assertEquals("Should produce correct number of records",
+        expected.size(), records.size());
+    Assert.assertEquals("Record set should match",
+        Sets.newHashSet(expected), records);
+    Assert.assertNotNull(Iterables.get(records, 0).getField("id"));
+    Assert.assertNotNull(Iterables.get(records, 0).getField("data"));
+  }
+
+  @Test
+  public void testUnknownSnapshotId() {
+    Long minSnapshotId = sharedTable.history().stream().map(h -> h.snapshotId()).min(Long::compareTo).get();
+
+    IcebergGenerics.ScanBuilder scanBuilder = IcebergGenerics.read(sharedTable);
+
+    AssertHelpers.assertThrows("Should fail on unknown snapshot id",
+        IllegalArgumentException.class,
+        "Cannot find snapshot with ID ",
+        () -> scanBuilder.useSnapshot(/* unknown snapshot id */ minSnapshotId - 1));
+  }
+
+  @Test
+  public void testAsOfTimeOlderThanFirstSnapshot() {
+    IcebergGenerics.ScanBuilder scanBuilder = IcebergGenerics.read(sharedTable);
+
+    AssertHelpers.assertThrows("Should fail on timestamp sooner than first write",
+        IllegalArgumentException.class,
+        "Cannot find a snapshot older than ",
+        () -> scanBuilder.asOfTime(/* older than first snapshot */ sharedTable.history().get(0).timestampMillis() - 1));
+  }
+
+  private DataFile writeFile(String location, String filename, List<Record> records) throws IOException {
     Path path = new Path(location, filename);
     FileFormat fileFormat = FileFormat.fromFileName(filename);
     Preconditions.checkNotNull(fileFormat, "Cannot determine format for file: %s", filename);
     switch (fileFormat) {
       case AVRO:
-        try (FileAppender<Record> appender = Avro.write(fromPath(path, CONF))
+        FileAppender avroAppender = Avro.write(fromPath(path, CONF))
             .schema(SCHEMA)
             .createWriterFunc(DataWriter::create)
             .named(fileFormat.name())
-            .build()) {
-          appender.addAll(records);
+            .build();
+        try {
+          avroAppender.addAll(records);
+        } finally {
+          avroAppender.close();
         }
 
-        return HadoopInputFile.fromPath(path, CONF);
+        return DataFiles.builder(PartitionSpec.unpartitioned())
+            .withInputFile(HadoopInputFile.fromPath(path, CONF))
+            .withMetrics(avroAppender.metrics())
+            .build();
 
       case PARQUET:
-        try (FileAppender<Record> appender = Parquet.write(fromPath(path, CONF))
+        FileAppender<Record> orcAppender = Parquet.write(fromPath(path, CONF))
 
 Review comment:
   Typo: should be `parquetAppender`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdsr commented on a change in pull request #750: Add time-travel methods (asOfTime, useSnapshot) in IcebergGenerics

Posted by GitBox <gi...@apache.org>.
rdsr commented on a change in pull request #750: Add time-travel methods (asOfTime,useSnapshot) in IcebergGenerics
URL: https://github.com/apache/incubator-iceberg/pull/750#discussion_r371424034
 
 

 ##########
 File path: data/src/main/java/org/apache/iceberg/data/IcebergGenerics.java
 ##########
 @@ -56,27 +61,33 @@ public ScanBuilder reuseContainers() {
     }
 
     public ScanBuilder where(Expression rowFilter) {
-      this.where = Expressions.and(where, rowFilter);
+      this.tableScan = this.tableScan.filter(Expressions.and(defaultWhere, rowFilter));
 
 Review comment:
   Seems like we can either keeping refining the `tableScan` obj in each of the builder methods in which case we might not need to have instance fields like `where`, `caseSensitive` etc and we should delete them where possible. Or we create the `tablescan` once in the `build` method in which case we will need to keep our own instance fields, but then we should remove the tablescan refine from each fo the build methods. Either approach is fine by me.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #750: Add time-travel methods (asOfTime, useSnapshot) in IcebergGenerics

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #750: Add time-travel methods (asOfTime,useSnapshot) in IcebergGenerics
URL: https://github.com/apache/incubator-iceberg/pull/750#discussion_r371381016
 
 

 ##########
 File path: data/src/test/java/org/apache/iceberg/data/TestLocalScan.java
 ##########
 @@ -215,28 +275,90 @@ public void testFullScan() {
     Iterable<Record> results = IcebergGenerics.read(sharedTable).build();
 
     Set<Record> expected = Sets.newHashSet();
-    expected.addAll(file1Records);
-    expected.addAll(file2Records);
-    expected.addAll(file3Records);
+    expected.addAll(file1SecondSnapshotRecords);
+    expected.addAll(file2SecondSnapshotRecords);
+    expected.addAll(file3SecondSnapshotRecords);
+
+    Set<Record> records = Sets.newHashSet(results);
+    Assert.assertEquals("Should produce correct number of records",
+        expected.size(), records.size());
+    Assert.assertEquals("Random record set should match",
+        Sets.newHashSet(expected), records);
+    Assert.assertNotNull(records.stream().findFirst().get().getField("id"));
+    Assert.assertNotNull(records.stream().findFirst().get().getField("data"));
+  }
+
+  @Test
+  public void testUnknownSnapshotId() {
+    exceptionRule.expect(IllegalArgumentException.class);
+    exceptionRule.expectMessage(startsWith("Cannot find snapshot with ID "));
+
+    Long minSnapshotId = sharedTable.history().stream().map(h -> h.snapshotId()).min(Long::compareTo).get();
+
+    IcebergGenerics.read(sharedTable)
+        .useSnapshot(/* unknown snapshot id */ minSnapshotId - 1);
+  }
+
+  @Test
+  public void testAsOfTimeOlderThanFirstSnapshot() {
+    exceptionRule.expect(IllegalArgumentException.class);
+    exceptionRule.expectMessage(startsWith("Cannot find a snapshot older than "));
+
+    IcebergGenerics.read(sharedTable)
+        .asOfTime(/* older than first snapshot */ sharedTable.history().get(0).timestampMillis() - 1);
+  }
+
+  @Test
+  public void testUseSnapshot() {
 
 Review comment:
   The table setup is done in a `@Before` method, so the table is recreated for each test case. Since the first snapshot is only used by this test case and `testAsOfTime`, you might consider moving the overwrite into a helper method. That would require fewer changes to existing test methods.
   
   If you did that, then this would start with `overwriteExistingData()` and then you could test that you can read the old data or the overwrite snapshot.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #750: Add time-travel methods (asOfTime, useSnapshot) in IcebergGenerics

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #750: Add time-travel methods (asOfTime,useSnapshot) in IcebergGenerics
URL: https://github.com/apache/incubator-iceberg/pull/750#discussion_r371377849
 
 

 ##########
 File path: data/src/test/java/org/apache/iceberg/data/TestLocalScan.java
 ##########
 @@ -114,53 +122,105 @@ public void createTables() throws IOException {
 
     Record record = GenericRecord.create(SCHEMA);
 
-    this.file1Records = Lists.newArrayList(
+    this.file1FirstSnapshotRecords = Lists.newArrayList(
+            record.copy(ImmutableMap.of("id", 4L, "data", "obscure")),
+            record.copy(ImmutableMap.of("id", 5L, "data", "secure")),
+            record.copy(ImmutableMap.of("id", 6L, "data", "fetta"))
+    );
+    InputFile file11 = writeFile(sharedTableLocation, format.addExtension("file-11"), file1FirstSnapshotRecords);
+
+    this.file2FirstSnapshotRecords = Lists.newArrayList(
+            record.copy(ImmutableMap.of("id", 14L, "data", "radical")),
+            record.copy(ImmutableMap.of("id", 15L, "data", "collocation")),
+            record.copy(ImmutableMap.of("id", 16L, "data", "book"))
+    );
+    InputFile file21 = writeFile(sharedTableLocation, format.addExtension("file-21"), file2FirstSnapshotRecords);
+
+    this.file3FirstSnapshotRecords = Lists.newArrayList(
+            record.copy(ImmutableMap.of("id", 24L, "data", "cloud")),
+            record.copy(ImmutableMap.of("id", 25L, "data", "zen")),
+            record.copy(ImmutableMap.of("id", 26L, "data", "sky"))
+    );
+    InputFile file31 = writeFile(sharedTableLocation, format.addExtension("file-31"), file3FirstSnapshotRecords);
+
+    this.file1SecondSnapshotRecords = Lists.newArrayList(
         record.copy(ImmutableMap.of("id", 0L, "data", "clarification")),
         record.copy(ImmutableMap.of("id", 1L, "data", "risky")),
         record.copy(ImmutableMap.of("id", 2L, "data", "falafel"))
     );
-    InputFile file1 = writeFile(sharedTableLocation, format.addExtension("file-1"), file1Records);
+    InputFile file12 = writeFile(sharedTableLocation, format.addExtension("file-12"), file1SecondSnapshotRecords);
 
     Record nullData = record.copy();
     nullData.setField("id", 11L);
     nullData.setField("data", null);
 
-    this.file2Records = Lists.newArrayList(
+    this.file2SecondSnapshotRecords = Lists.newArrayList(
         record.copy(ImmutableMap.of("id", 10L, "data", "clammy")),
         record.copy(ImmutableMap.of("id", 11L, "data", "evacuate")),
         record.copy(ImmutableMap.of("id", 12L, "data", "tissue"))
     );
-    InputFile file2 = writeFile(sharedTableLocation, format.addExtension("file-2"), file2Records);
+    InputFile file22 = writeFile(sharedTableLocation, format.addExtension("file-22"), file2SecondSnapshotRecords);
 
-    this.file3Records = Lists.newArrayList(
+    this.file3SecondSnapshotRecords = Lists.newArrayList(
         record.copy(ImmutableMap.of("id", 20L, "data", "ocean")),
         record.copy(ImmutableMap.of("id", 21L, "data", "holistic")),
         record.copy(ImmutableMap.of("id", 22L, "data", "preventative"))
     );
-    InputFile file3 = writeFile(sharedTableLocation, format.addExtension("file-3"), file3Records);
+    InputFile file32 = writeFile(sharedTableLocation, format.addExtension("file-32"), file3SecondSnapshotRecords);
 
     // commit the test data
     sharedTable.newAppend()
         .appendFile(DataFiles.builder(PartitionSpec.unpartitioned())
-            .withInputFile(file1)
+            .withInputFile(file11)
+            .withMetrics(new Metrics(3L,
 
 Review comment:
   Instead of hard-coding the metrics for each data file, I think that this PR should update `writeFile` to return a `DataFile` instance. That method can call `appender.metrics()` to get the metrics when creating the `DataFile`, which would ensure they are always correct. It would also make this setup method much shorter.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #750: Add time-travel methods (asOfTime, useSnapshot) in IcebergGenerics

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #750: Add time-travel methods (asOfTime,useSnapshot) in IcebergGenerics
URL: https://github.com/apache/incubator-iceberg/pull/750#discussion_r371374939
 
 

 ##########
 File path: data/src/test/java/org/apache/iceberg/data/TestLocalScan.java
 ##########
 @@ -215,28 +275,90 @@ public void testFullScan() {
     Iterable<Record> results = IcebergGenerics.read(sharedTable).build();
 
     Set<Record> expected = Sets.newHashSet();
-    expected.addAll(file1Records);
-    expected.addAll(file2Records);
-    expected.addAll(file3Records);
+    expected.addAll(file1SecondSnapshotRecords);
+    expected.addAll(file2SecondSnapshotRecords);
+    expected.addAll(file3SecondSnapshotRecords);
+
+    Set<Record> records = Sets.newHashSet(results);
+    Assert.assertEquals("Should produce correct number of records",
+        expected.size(), records.size());
+    Assert.assertEquals("Random record set should match",
+        Sets.newHashSet(expected), records);
+    Assert.assertNotNull(records.stream().findFirst().get().getField("id"));
+    Assert.assertNotNull(records.stream().findFirst().get().getField("data"));
+  }
+
+  @Test
+  public void testUnknownSnapshotId() {
+    exceptionRule.expect(IllegalArgumentException.class);
+    exceptionRule.expectMessage(startsWith("Cannot find snapshot with ID "));
+
+    Long minSnapshotId = sharedTable.history().stream().map(h -> h.snapshotId()).min(Long::compareTo).get();
+
+    IcebergGenerics.read(sharedTable)
+        .useSnapshot(/* unknown snapshot id */ minSnapshotId - 1);
+  }
+
+  @Test
+  public void testAsOfTimeOlderThanFirstSnapshot() {
+    exceptionRule.expect(IllegalArgumentException.class);
+    exceptionRule.expectMessage(startsWith("Cannot find a snapshot older than "));
+
+    IcebergGenerics.read(sharedTable)
+        .asOfTime(/* older than first snapshot */ sharedTable.history().get(0).timestampMillis() - 1);
+  }
+
+  @Test
+  public void testUseSnapshot() {
+    Iterable<Record> results = IcebergGenerics.read(sharedTable)
+        .useSnapshot(/* first snapshot */ sharedTable.history().get(0).snapshotId())
+        .build();
+
+    Set<Record> expected = Sets.newHashSet();
+    expected.addAll(file1FirstSnapshotRecords);
+    expected.addAll(file2FirstSnapshotRecords);
+    expected.addAll(file3FirstSnapshotRecords);
+
+    Set<Record> records = Sets.newHashSet(results);
+    Assert.assertEquals("Should produce correct number of records",
+        expected.size(), records.size());
+    Assert.assertEquals("Random record set should match",
+        Sets.newHashSet(expected), records);
+    Assert.assertNotNull(records.stream().findFirst().get().getField("id"));
+    Assert.assertNotNull(records.stream().findFirst().get().getField("data"));
+  }
+
+  @Test
+  public void testAsOfTime() {
+    Iterable<Record> results = IcebergGenerics.read(sharedTable)
+        .asOfTime(/* timestamp first snapshot */ sharedTable.history().get(0).timestampMillis())
+        .build();
+
+    Set<Record> expected = Sets.newHashSet();
+    expected.addAll(file1FirstSnapshotRecords);
+    expected.addAll(file2FirstSnapshotRecords);
+    expected.addAll(file3FirstSnapshotRecords);
 
     Set<Record> records = Sets.newHashSet(results);
     Assert.assertEquals("Should produce correct number of records",
         expected.size(), records.size());
     Assert.assertEquals("Random record set should match",
         Sets.newHashSet(expected), records);
+    Assert.assertNotNull(records.stream().findFirst().get().getField("id"));
 
 Review comment:
   Minor: using Iterables rather than streams is less verbose: `Iterables.get(set, 0).getField("id")`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdsr commented on a change in pull request #750: Add time-travel methods (asOfTime, useSnapshot) in IcebergGenerics

Posted by GitBox <gi...@apache.org>.
rdsr commented on a change in pull request #750: Add time-travel methods (asOfTime,useSnapshot) in IcebergGenerics
URL: https://github.com/apache/incubator-iceberg/pull/750#discussion_r371424034
 
 

 ##########
 File path: data/src/main/java/org/apache/iceberg/data/IcebergGenerics.java
 ##########
 @@ -56,27 +61,33 @@ public ScanBuilder reuseContainers() {
     }
 
     public ScanBuilder where(Expression rowFilter) {
-      this.where = Expressions.and(where, rowFilter);
+      this.tableScan = this.tableScan.filter(Expressions.and(defaultWhere, rowFilter));
 
 Review comment:
   Seems like we can either keeping refining the `tableScan` obj in each of the builder methods in which case we might not need to have instance fields like `where`, `caseSensitive` etc and we should delete them where possible. Or we create the `tablescan` once in the `build()` method in which case we will need to keep our own instance fields, but then we should remove the tablescan refine from each fo the build methods. Either approach is fine by me.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #750: Add time-travel methods (asOfTime, useSnapshot) in IcebergGenerics

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #750: Add time-travel methods (asOfTime,useSnapshot) in IcebergGenerics
URL: https://github.com/apache/incubator-iceberg/pull/750#discussion_r371373953
 
 

 ##########
 File path: data/src/main/java/org/apache/iceberg/data/IcebergGenerics.java
 ##########
 @@ -41,13 +42,17 @@ public static ScanBuilder read(Table table) {
 
   public static class ScanBuilder {
     private final Table table;
-    private Expression where = Expressions.alwaysTrue();
-    private List<String> columns = ImmutableList.of("*");
+    private TableScan tableScan;
+    private final Expression defaultWhere = Expressions.alwaysTrue();
+    private final List<String> defaultColumns = ImmutableList.of("*");
     private boolean reuseContainers = false;
-    private boolean caseSensitive = true;
+    private final boolean defaultCaseSensitive = true;
 
     public ScanBuilder(Table table) {
       this.table = table;
+      this.tableScan = table.newScan()
+          .select(this.defaultColumns)
+          .caseSensitive(this.defaultCaseSensitive);
 
 Review comment:
   Instead of having defaults here as well as in the table scan, this should just create a new scan and rely on its defaults.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #750: Add time-travel methods (asOfTime, useSnapshot) in IcebergGenerics

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #750: Add time-travel methods (asOfTime,useSnapshot) in IcebergGenerics
URL: https://github.com/apache/incubator-iceberg/pull/750#discussion_r371379140
 
 

 ##########
 File path: data/src/test/java/org/apache/iceberg/data/TestLocalScan.java
 ##########
 @@ -215,28 +275,90 @@ public void testFullScan() {
     Iterable<Record> results = IcebergGenerics.read(sharedTable).build();
 
     Set<Record> expected = Sets.newHashSet();
-    expected.addAll(file1Records);
-    expected.addAll(file2Records);
-    expected.addAll(file3Records);
+    expected.addAll(file1SecondSnapshotRecords);
+    expected.addAll(file2SecondSnapshotRecords);
+    expected.addAll(file3SecondSnapshotRecords);
+
+    Set<Record> records = Sets.newHashSet(results);
+    Assert.assertEquals("Should produce correct number of records",
+        expected.size(), records.size());
+    Assert.assertEquals("Random record set should match",
+        Sets.newHashSet(expected), records);
+    Assert.assertNotNull(records.stream().findFirst().get().getField("id"));
+    Assert.assertNotNull(records.stream().findFirst().get().getField("data"));
+  }
+
+  @Test
+  public void testUnknownSnapshotId() {
+    exceptionRule.expect(IllegalArgumentException.class);
+    exceptionRule.expectMessage(startsWith("Cannot find snapshot with ID "));
+
+    Long minSnapshotId = sharedTable.history().stream().map(h -> h.snapshotId()).min(Long::compareTo).get();
+
+    IcebergGenerics.read(sharedTable)
+        .useSnapshot(/* unknown snapshot id */ minSnapshotId - 1);
+  }
+
+  @Test
+  public void testAsOfTimeOlderThanFirstSnapshot() {
+    exceptionRule.expect(IllegalArgumentException.class);
+    exceptionRule.expectMessage(startsWith("Cannot find a snapshot older than "));
+
+    IcebergGenerics.read(sharedTable)
+        .asOfTime(/* older than first snapshot */ sharedTable.history().get(0).timestampMillis() - 1);
+  }
+
+  @Test
+  public void testUseSnapshot() {
+    Iterable<Record> results = IcebergGenerics.read(sharedTable)
+        .useSnapshot(/* first snapshot */ sharedTable.history().get(0).snapshotId())
+        .build();
+
+    Set<Record> expected = Sets.newHashSet();
+    expected.addAll(file1FirstSnapshotRecords);
+    expected.addAll(file2FirstSnapshotRecords);
+    expected.addAll(file3FirstSnapshotRecords);
+
+    Set<Record> records = Sets.newHashSet(results);
+    Assert.assertEquals("Should produce correct number of records",
+        expected.size(), records.size());
+    Assert.assertEquals("Random record set should match",
 
 Review comment:
   Looks like a copy/paste error. This isn't a random record set. Can you update this to "Record set should match"?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org