You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/12/23 21:50:42 UTC

[GitHub] [iceberg] rdblue opened a new pull request #1981: Fix date and timestamp transforms

rdblue opened a new pull request #1981:
URL: https://github.com/apache/iceberg/pull/1981


   This fixes incorrect values produced by date and time transforms and adds tests.
   
   This also updates date and timestamp transform projection methods to include incorrectly transformed values.
   
   Fixes #1680.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #1981: Fix date and timestamp transforms

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #1981:
URL: https://github.com/apache/iceberg/pull/1981#discussion_r552939902



##########
File path: api/src/main/java/org/apache/iceberg/transforms/ProjectionUtil.java
##########
@@ -254,4 +257,145 @@ private ProjectionUtil() {
     return predicate(predicate.op(), fieldName,
         Iterables.transform(predicate.asSetPredicate().literalSet(), transform::apply));
   }
+
+  /**
+   * Fixes an inclusive projection to account for incorrectly transformed values.
+   * <p>
+   * A bug in 0.10.0 and earlier caused negative values to be incorrectly transformed by date and timestamp transforms
+   * to 1 larger than the correct value. For example, day(1969-12-31 10:00:00) produced 0 instead of -1. To read data
+   * written by versions with this bug, this method adjusts the inclusive projection. The current inclusive projection
+   * is correct, so this modifies the "correct" projection when needed. For example, < day(1969-12-31 10:00:00) will
+   * produce <= -1 (= 1969-12-31) and is adjusted to <= 0 (= 1969-01-01) because the incorrect transformed value was 0.
+   */
+  static UnboundPredicate<Integer> fixInclusiveTimeProjection(UnboundPredicate<Integer> projected) {
+    if (projected == null) {
+      return projected;
+    }
+
+    // adjust the predicate for values that were 1 larger than the correct transformed value
+    switch (projected.op()) {
+      case LT:
+        if (projected.literal().value() < 0) {
+          return Expressions.lessThan(projected.term(), projected.literal().value() + 1);
+        }
+
+        return projected;
+
+      case LT_EQ:
+        if (projected.literal().value() < 0) {
+          return Expressions.lessThanOrEqual(projected.term(), projected.literal().value() + 1);
+        }
+
+        return projected;
+
+      case GT:
+      case GT_EQ:
+        // incorrect projected values are already greater than the bound for GT, GT_EQ
+        return projected;
+
+      case EQ:
+        if (projected.literal().value() < 0) {
+          // match either the incorrect value (projectedValue + 1) or the correct value (projectedValue)
+          return Expressions.in(projected.term(), projected.literal().value(), projected.literal().value() + 1);
+        }
+
+        return projected;
+
+      case IN:
+        Set<Integer> fixedSet = Sets.newHashSet();
+        boolean hasNegativeValue = false;
+        for (Literal<Integer> lit : projected.literals()) {
+          Integer value = lit.value();
+          fixedSet.add(value);
+          if (value < 0) {
+            hasNegativeValue = true;
+            fixedSet.add(value + 1);
+          }
+        }
+
+        if (hasNegativeValue) {
+          return Expressions.in(projected.term(), fixedSet);

Review comment:
       why not just always return this new expression? We build up fixedSet no matter what?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1981: Fix date and timestamp transforms

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1981:
URL: https://github.com/apache/iceberg/pull/1981#discussion_r549826545



##########
File path: api/src/main/java/org/apache/iceberg/transforms/ProjectionUtil.java
##########
@@ -254,4 +257,124 @@ private ProjectionUtil() {
     return predicate(predicate.op(), fieldName,
         Iterables.transform(predicate.asSetPredicate().literalSet(), transform::apply));
   }
+
+  static UnboundPredicate<Integer> fixInclusiveTimeProjection(UnboundPredicate<Integer> projected) {
+    if (projected == null) {
+      return projected;
+    }
+
+    // adjust the predicate for values that were 1 larger than the correct transformed value

Review comment:
       I added docs to explain what "correct" and "incorrect" mean.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1981: Fix date and timestamp transforms

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1981:
URL: https://github.com/apache/iceberg/pull/1981#issuecomment-757058140


   > Does it also mean that pre-0.11 readers will not be able to read the new files correctly when the data is before the epoch?
   
   Yes. Older readers will still have the bug. I think we can only make a backward-compatible fix for this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #1981: Fix date and timestamp transforms

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #1981:
URL: https://github.com/apache/iceberg/pull/1981


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1981: Fix date and timestamp transforms

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1981:
URL: https://github.com/apache/iceberg/pull/1981#issuecomment-751026580


   > To make sure I understand correctly, changes in projection methods ensure that both behaviors before and after this fix will be accounted for by the projection, so that we might not need to have separate implementations for format v1 versus v2, with a slight penalty that in v2 we may scan more data than we have to?
   
   Yes, this fixes the transforms and ensures that predicate projection includes the partitions that were written with bad values. That means that we won't need different implementations for v2, but it also means that we can avoid scanning the extra partitions in the future. Because this is fixed before v2, we can ensure that all v2 tables have been fixed. So if a table is created as v2, we should be able to know that no older writers with the bug have written to the table. The only case where a v2 table would have bad metadata is when a v1 table has been converted. We should add a flag to signal that the table was converted from v1 or one that signals it was created as v2 that allows us to skip the extra partitions.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on a change in pull request #1981: Fix date and timestamp transforms

Posted by GitBox <gi...@apache.org>.
jun-he commented on a change in pull request #1981:
URL: https://github.com/apache/iceberg/pull/1981#discussion_r552936378



##########
File path: api/src/main/java/org/apache/iceberg/transforms/Timestamps.java
##########
@@ -52,12 +52,23 @@ public Integer apply(Long timestampMicros) {
       return null;
     }
 
-    // discards fractional seconds, not needed for calculation
-    OffsetDateTime timestamp = Instant
-        .ofEpochSecond(timestampMicros / 1_000_000)
-        .atOffset(ZoneOffset.UTC);
-
-    return (int) granularity.between(EPOCH, timestamp);
+    if (timestampMicros >= 0) {
+      OffsetDateTime timestamp = Instant
+          .ofEpochSecond(
+              Math.floorDiv(timestampMicros, 1_000_000),
+              Math.floorMod(timestampMicros, 1_000_000) * 1000)

Review comment:
       nit: extract `1_000_000` and `1000` to be a constant with meaningful name or use predefined ones in java library.

##########
File path: api/src/main/java/org/apache/iceberg/transforms/Dates.java
##########
@@ -99,11 +107,24 @@ public boolean satisfiesOrderOf(Transform<?, ?> other) {
 
     if (pred.isUnaryPredicate()) {
       return Expressions.predicate(pred.op(), fieldName);
+
     } else if (pred.isLiteralPredicate()) {
-      return ProjectionUtil.truncateInteger(fieldName, pred.asLiteralPredicate(), this);
+      UnboundPredicate<Integer> projected = ProjectionUtil.truncateInteger(fieldName, pred.asLiteralPredicate(), this);
+      if (this != DAY) {

Review comment:
       Seems the same logic (`if (condition) {return...} return ...`) appears multiple times, can it be included inside a method, e.g. just embed it in the method defined in `ProjectionUtil`?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks edited a comment on pull request #1981: Fix date and timestamp transforms

Posted by GitBox <gi...@apache.org>.
danielcweeks edited a comment on pull request #1981:
URL: https://github.com/apache/iceberg/pull/1981#issuecomment-765040351


   ~~So, the best I can follow, this looks good, but there's one issue I'm not clear on.  If we're just expanding the projection to include possible matches (e.g. equality case), doesn't this cause a problem where the partition values are joined in (i.e. not materialized in the data like converted Hive tables)?
   
   I assume we don't have a way to correct for that case since the recorded transform is incorrect and we cannot materialize the correct data to filter later?~~
   
   Nevermind, they can't really have transforms in the partitions anyway.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1981: Fix date and timestamp transforms

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1981:
URL: https://github.com/apache/iceberg/pull/1981#discussion_r548271146



##########
File path: api/src/test/java/org/apache/iceberg/transforms/TestTimestampsProjection.java
##########
@@ -74,22 +74,22 @@ public void assertProjectionStrictValue(PartitionSpec spec, UnboundPredicate<?>
                                           Expression.Operation expectedOp) {
 
     Expression projection = Projections.strict(spec).project(filter);
-    Assert.assertEquals(projection.op(), expectedOp);
+    Assert.assertEquals(expectedOp, projection.op());
   }
 
   public void assertProjectionInclusiveValue(PartitionSpec spec, UnboundPredicate<?> filter,
                                              Expression.Operation expectedOp) {
 
     Expression projection = Projections.inclusive(spec).project(filter);
-    Assert.assertEquals(projection.op(), expectedOp);
+    Assert.assertEquals(expectedOp, projection.op());
   }
 
   public void assertProjectionInclusive(PartitionSpec spec, UnboundPredicate<?> filter,
                                         Expression.Operation expectedOp, String expectedLiteral) {
     Expression projection = Projections.inclusive(spec).project(filter);
     UnboundPredicate<?> predicate = assertAndUnwrapUnbound(projection);
 
-    Assert.assertEquals(predicate.op(), expectedOp);

Review comment:
       More tests need to be added here. I just wanted to get this up for review since I'm going to be out for a couple of days.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] boroknagyz commented on pull request #1981: Fix date and timestamp transforms

Posted by GitBox <gi...@apache.org>.
boroknagyz commented on pull request #1981:
URL: https://github.com/apache/iceberg/pull/1981#issuecomment-757054106


   Thanks for fixing this, I've just run into this issue with Impala. (And this fix is also a prerequisite for issue #2043)
   
   So the plan is that 0.11+ writers will write the correct transformation values IIUC. Does it also mean that pre-0.11 readers will not be able to read the new files correctly when the data is before the epoch?
   
   E.g. new writer writes '1969-12-31 23:05:00' with YEAR partition transform (so ts_year=-1), old reader scans for 'ts > 1969-12-31 23:00:00' (so with old transform the projected predicate is ts_year >=0), and the file will be skipped because the partition data has ts_year=-1.
   
   I just want to clarify this, because with Impala I'm planning to write V1 files with the fixed transforms.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1981: Fix date and timestamp transforms

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1981:
URL: https://github.com/apache/iceberg/pull/1981#discussion_r549822454



##########
File path: api/src/main/java/org/apache/iceberg/transforms/Timestamps.java
##########
@@ -52,12 +52,19 @@ public Integer apply(Long timestampMicros) {
       return null;
     }
 
-    // discards fractional seconds, not needed for calculation
-    OffsetDateTime timestamp = Instant
-        .ofEpochSecond(timestampMicros / 1_000_000)
-        .atOffset(ZoneOffset.UTC);
-
-    return (int) granularity.between(EPOCH, timestamp);
+    if (timestampMicros >= 0) {
+      OffsetDateTime timestamp = Instant
+          .ofEpochSecond(Math.floorDiv(timestampMicros, 1_000_000), Math.floorMod(timestampMicros, 1_000_000))
+          .atOffset(ZoneOffset.UTC);
+      return (int) granularity.between(EPOCH, timestamp);
+    } else {
+      // add 1 micro to the value to account for the case where there is exactly 1 unit between the timestamp and epoch
+      // because the result will always be decremented.
+      OffsetDateTime timestamp = Instant
+          .ofEpochSecond(Math.floorDiv(timestampMicros, 1_000_000), Math.floorMod(timestampMicros + 1, 1_000_000))

Review comment:
       Fixed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks commented on pull request #1981: Fix date and timestamp transforms

Posted by GitBox <gi...@apache.org>.
danielcweeks commented on pull request #1981:
URL: https://github.com/apache/iceberg/pull/1981#issuecomment-765040351


   So, the best I can follow, this looks good, but there's one issue I'm not clear on.  If we're just expanding the projection to include possible matches (e.g. equality case), doesn't this cause a problem where the partition values are joined in (i.e. not materialized in the data like converted Hive tables)?
   
   I assume we don't have a way to correct for that case since the recorded transform is incorrect and we cannot materialize the correct data to filter later?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1981: Fix date and timestamp transforms

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1981:
URL: https://github.com/apache/iceberg/pull/1981#discussion_r553008256



##########
File path: api/src/main/java/org/apache/iceberg/transforms/ProjectionUtil.java
##########
@@ -254,4 +257,145 @@ private ProjectionUtil() {
     return predicate(predicate.op(), fieldName,
         Iterables.transform(predicate.asSetPredicate().literalSet(), transform::apply));
   }
+
+  /**
+   * Fixes an inclusive projection to account for incorrectly transformed values.
+   * <p>
+   * A bug in 0.10.0 and earlier caused negative values to be incorrectly transformed by date and timestamp transforms
+   * to 1 larger than the correct value. For example, day(1969-12-31 10:00:00) produced 0 instead of -1. To read data
+   * written by versions with this bug, this method adjusts the inclusive projection. The current inclusive projection
+   * is correct, so this modifies the "correct" projection when needed. For example, < day(1969-12-31 10:00:00) will
+   * produce <= -1 (= 1969-12-31) and is adjusted to <= 0 (= 1969-01-01) because the incorrect transformed value was 0.
+   */
+  static UnboundPredicate<Integer> fixInclusiveTimeProjection(UnboundPredicate<Integer> projected) {
+    if (projected == null) {
+      return projected;
+    }
+
+    // adjust the predicate for values that were 1 larger than the correct transformed value
+    switch (projected.op()) {
+      case LT:
+        if (projected.literal().value() < 0) {
+          return Expressions.lessThan(projected.term(), projected.literal().value() + 1);
+        }
+
+        return projected;
+
+      case LT_EQ:
+        if (projected.literal().value() < 0) {
+          return Expressions.lessThanOrEqual(projected.term(), projected.literal().value() + 1);
+        }
+
+        return projected;
+
+      case GT:
+      case GT_EQ:
+        // incorrect projected values are already greater than the bound for GT, GT_EQ
+        return projected;
+
+      case EQ:
+        if (projected.literal().value() < 0) {
+          // match either the incorrect value (projectedValue + 1) or the correct value (projectedValue)
+          return Expressions.in(projected.term(), projected.literal().value(), projected.literal().value() + 1);
+        }
+
+        return projected;
+
+      case IN:
+        Set<Integer> fixedSet = Sets.newHashSet();
+        boolean hasNegativeValue = false;
+        for (Literal<Integer> lit : projected.literals()) {
+          Integer value = lit.value();
+          fixedSet.add(value);
+          if (value < 0) {
+            hasNegativeValue = true;
+            fixedSet.add(value + 1);
+          }
+        }
+
+        if (hasNegativeValue) {
+          return Expressions.in(projected.term(), fixedSet);

Review comment:
       If there is no negative value, then there is no need to fixup the expression and we can return the original. That avoids some object allocation?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #1981: Fix date and timestamp transforms

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #1981:
URL: https://github.com/apache/iceberg/pull/1981


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks edited a comment on pull request #1981: Fix date and timestamp transforms

Posted by GitBox <gi...@apache.org>.
danielcweeks edited a comment on pull request #1981:
URL: https://github.com/apache/iceberg/pull/1981#issuecomment-765040351


   ~~So, the best I can follow, this looks good, but there's one issue I'm not clear on.  If we're just expanding the projection to include possible matches (e.g. equality case), doesn't this cause a problem where the partition values are joined in (i.e. not materialized in the data like converted Hive tables)?~~
   
   ~~I assume we don't have a way to correct for that case since the recorded transform is incorrect and we cannot materialize the correct data to filter later?~~
   
   Nevermind, they can't really have transforms in the partitions anyway.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks edited a comment on pull request #1981: Fix date and timestamp transforms

Posted by GitBox <gi...@apache.org>.
danielcweeks edited a comment on pull request #1981:
URL: https://github.com/apache/iceberg/pull/1981#issuecomment-765040351






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1981: Fix date and timestamp transforms

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1981:
URL: https://github.com/apache/iceberg/pull/1981#issuecomment-765046766


   Thanks for the reviews, @danielcweeks, @yyanyy, @RussellSpitzer, @jun-he, and @boroknagyz!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1981: Fix date and timestamp transforms

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1981:
URL: https://github.com/apache/iceberg/pull/1981#issuecomment-757058140


   > Does it also mean that pre-0.11 readers will not be able to read the new files correctly when the data is before the epoch?
   
   Yes. Older readers will still have the bug. I think we can only make a backward-compatible fix for this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1981: Fix date and timestamp transforms

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1981:
URL: https://github.com/apache/iceberg/pull/1981#discussion_r549823791



##########
File path: api/src/test/java/org/apache/iceberg/transforms/TestDatesProjection.java
##########
@@ -142,6 +161,24 @@ public void testMonthStrictUpperBound() {
     assertProjectionStrictValue(spec, in("date", anotherDate, date), Expression.Operation.FALSE);
   }
 
+  @Test
+  public void testNegativeMonthStrictUpperBound() {
+    Integer date = (Integer) Literal.of("1969-12-31").to(TYPE).value();
+    PartitionSpec spec = PartitionSpec.builderFor(SCHEMA).month("date").build();
+
+    assertProjectionStrictValue(spec, lessThan("date", date), Expression.Operation.FALSE);
+    assertProjectionStrictValue(spec, lessThanOrEqual("date", date), Expression.Operation.FALSE);
+    assertProjectionStrict(spec, greaterThan("date", date), Expression.Operation.GT, "1969-12");
+    assertProjectionStrict(spec, greaterThanOrEqual("date", date), Expression.Operation.GT, "1969-12");
+    assertProjectionStrict(spec, notEqual("date", date), Expression.Operation.NOT_IN, "[1969-12, 1970-01]");
+    assertProjectionStrictValue(spec, equal("date", date), Expression.Operation.FALSE);
+
+    Integer anotherDate = (Integer) Literal.of("1970-01-01").to(TYPE).value();
+    assertProjectionStrict(spec, notIn("date", date, anotherDate),
+        Expression.Operation.NOT_IN, "[1969-12, 1970-01]");

Review comment:
       I updated the other test to use a lower bound that is not 1970-01-01 and also added tests specifically for epoch values. This still conflicted with the epoch test, so I changed it to use 1969-11-01.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1981: Fix date and timestamp transforms

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1981:
URL: https://github.com/apache/iceberg/pull/1981#issuecomment-765046766


   Thanks for the reviews, @danielcweeks, @yyanyy, @RussellSpitzer, @jun-he, and @boroknagyz!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks commented on pull request #1981: Fix date and timestamp transforms

Posted by GitBox <gi...@apache.org>.
danielcweeks commented on pull request #1981:
URL: https://github.com/apache/iceberg/pull/1981#issuecomment-765040351


   So, the best I can follow, this looks good, but there's one issue I'm not clear on.  If we're just expanding the projection to include possible matches (e.g. equality case), doesn't this cause a problem where the partition values are joined in (i.e. not materialized in the data like converted Hive tables)?
   
   I assume we don't have a way to correct for that case since the recorded transform is incorrect and we cannot materialize the correct data to filter later?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1981: Fix date and timestamp transforms

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1981:
URL: https://github.com/apache/iceberg/pull/1981#discussion_r562304565



##########
File path: api/src/main/java/org/apache/iceberg/transforms/Dates.java
##########
@@ -99,11 +107,24 @@ public boolean satisfiesOrderOf(Transform<?, ?> other) {
 
     if (pred.isUnaryPredicate()) {
       return Expressions.predicate(pred.op(), fieldName);
+
     } else if (pred.isLiteralPredicate()) {
-      return ProjectionUtil.truncateInteger(fieldName, pred.asLiteralPredicate(), this);
+      UnboundPredicate<Integer> projected = ProjectionUtil.truncateInteger(fieldName, pred.asLiteralPredicate(), this);
+      if (this != DAY) {

Review comment:
       There are only 4 instance of this, and they call 2 different fix methods. I don't think it would be worth adding 2 methods just to dedup 4 lines here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #1981: Fix date and timestamp transforms

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #1981:
URL: https://github.com/apache/iceberg/pull/1981#discussion_r548345460



##########
File path: api/src/main/java/org/apache/iceberg/transforms/Timestamps.java
##########
@@ -52,12 +52,19 @@ public Integer apply(Long timestampMicros) {
       return null;
     }
 
-    // discards fractional seconds, not needed for calculation
-    OffsetDateTime timestamp = Instant
-        .ofEpochSecond(timestampMicros / 1_000_000)
-        .atOffset(ZoneOffset.UTC);
-
-    return (int) granularity.between(EPOCH, timestamp);
+    if (timestampMicros >= 0) {
+      OffsetDateTime timestamp = Instant
+          .ofEpochSecond(Math.floorDiv(timestampMicros, 1_000_000), Math.floorMod(timestampMicros, 1_000_000))
+          .atOffset(ZoneOffset.UTC);
+      return (int) granularity.between(EPOCH, timestamp);
+    } else {
+      // add 1 micro to the value to account for the case where there is exactly 1 unit between the timestamp and epoch
+      // because the result will always be decremented.
+      OffsetDateTime timestamp = Instant
+          .ofEpochSecond(Math.floorDiv(timestampMicros, 1_000_000), Math.floorMod(timestampMicros + 1, 1_000_000))

Review comment:
       I think the second argument in `Instant.ofEpochSecond` is nanosecond instead of microsecond? Although this doesn't change the outcome of this method since we don't accept nanosecond in the first place, and inputing microsecond for nanosecond field just makes the field smaller than it should be, which is already enough to address the case when `Math.floorMod()` == 0

##########
File path: api/src/test/java/org/apache/iceberg/transforms/TestDatesProjection.java
##########
@@ -142,6 +161,24 @@ public void testMonthStrictUpperBound() {
     assertProjectionStrictValue(spec, in("date", anotherDate, date), Expression.Operation.FALSE);
   }
 
+  @Test
+  public void testNegativeMonthStrictUpperBound() {
+    Integer date = (Integer) Literal.of("1969-12-31").to(TYPE).value();
+    PartitionSpec spec = PartitionSpec.builderFor(SCHEMA).month("date").build();
+
+    assertProjectionStrictValue(spec, lessThan("date", date), Expression.Operation.FALSE);
+    assertProjectionStrictValue(spec, lessThanOrEqual("date", date), Expression.Operation.FALSE);
+    assertProjectionStrict(spec, greaterThan("date", date), Expression.Operation.GT, "1969-12");
+    assertProjectionStrict(spec, greaterThanOrEqual("date", date), Expression.Operation.GT, "1969-12");
+    assertProjectionStrict(spec, notEqual("date", date), Expression.Operation.NOT_IN, "[1969-12, 1970-01]");
+    assertProjectionStrictValue(spec, equal("date", date), Expression.Operation.FALSE);
+
+    Integer anotherDate = (Integer) Literal.of("1970-01-01").to(TYPE).value();
+    assertProjectionStrict(spec, notIn("date", date, anotherDate),
+        Expression.Operation.NOT_IN, "[1969-12, 1970-01]");

Review comment:
       I think this is testing the same thing for `NOT_IN` in `testNegativeMonthStrictLowerBound`, do we want to change 1970-01-01 to an earlier time? 

##########
File path: api/src/main/java/org/apache/iceberg/transforms/ProjectionUtil.java
##########
@@ -254,4 +257,124 @@ private ProjectionUtil() {
     return predicate(predicate.op(), fieldName,
         Iterables.transform(predicate.asSetPredicate().literalSet(), transform::apply));
   }
+
+  static UnboundPredicate<Integer> fixInclusiveTimeProjection(UnboundPredicate<Integer> projected) {
+    if (projected == null) {
+      return projected;
+    }
+
+    // adjust the predicate for values that were 1 larger than the correct transformed value

Review comment:
       I think the mention of "correct" and "incorrect" in these comments might not be obvious to understand outside the context of this PR, as the reader has no prior knowledge of the issue we are fixing here from looking at the code base itself. Do we want to add some more explanation/link to the issue? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] boroknagyz commented on pull request #1981: Fix date and timestamp transforms

Posted by GitBox <gi...@apache.org>.
boroknagyz commented on pull request #1981:
URL: https://github.com/apache/iceberg/pull/1981#issuecomment-757054106


   Thanks for fixing this, I've just run into this issue with Impala. (And this fix is also a prerequisite for issue #2043)
   
   So the plan is that 0.11+ writers will write the correct transformation values IIUC. Does it also mean that pre-0.11 readers will not be able to read the new files correctly when the data is before the epoch?
   
   E.g. new writer writes '1969-12-31 23:05:00' with YEAR partition transform (so ts_year=-1), old reader scans for 'ts > 1969-12-31 23:00:00' (so with old transform the projected predicate is ts_year >=0), and the file will be skipped because the partition data has ts_year=-1.
   
   I just want to clarify this, because with Impala I'm planning to write V1 files with the fixed transforms.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1981: Fix date and timestamp transforms

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1981:
URL: https://github.com/apache/iceberg/pull/1981#discussion_r562304565



##########
File path: api/src/main/java/org/apache/iceberg/transforms/Dates.java
##########
@@ -99,11 +107,24 @@ public boolean satisfiesOrderOf(Transform<?, ?> other) {
 
     if (pred.isUnaryPredicate()) {
       return Expressions.predicate(pred.op(), fieldName);
+
     } else if (pred.isLiteralPredicate()) {
-      return ProjectionUtil.truncateInteger(fieldName, pred.asLiteralPredicate(), this);
+      UnboundPredicate<Integer> projected = ProjectionUtil.truncateInteger(fieldName, pred.asLiteralPredicate(), this);
+      if (this != DAY) {

Review comment:
       There are only 4 instance of this, and they call 2 different fix methods. I don't think it would be worth adding 2 methods just to dedup 4 lines here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org