You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/10/28 21:32:39 UTC

[GitHub] [iceberg] rdblue opened a new pull request #1675: API: Extend partitioning and sort order

rdblue opened a new pull request #1675:
URL: https://github.com/apache/iceberg/pull/1675


   This updates the partitioning and sort order API with utilities needed for required distribution and ordering.
   
   * Adds `Expressions.transform` to create an `UnboundTransform` term from an existing `Transform` and a field name.
   * Adds `SortField.satisfies` and `Transform.satisfiesOrderOf` to implement satisfies between compatible fields, like days and hours. Hours satisfies the ordering of days.
   * Updates `PartitionSpecVisitor` to pass partition field IDs and adds `unknown` and `alwaysNull` visitor methods.
   * Adds `SortOrderVisitor` to visit a sort order
   * Adds `Partitioning` helper methods `hasBucketField(PartitionSpec)` and `sortOrderFor(PartitionSpec)`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1675: API: Extend partitioning and sort order

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1675:
URL: https://github.com/apache/iceberg/pull/1675#discussion_r517041062



##########
File path: core/src/main/java/org/apache/iceberg/Partitioning.java
##########
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.util.List;
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.transforms.PartitionSpecVisitor;
+
+public class Partitioning {
+  private Partitioning() {
+  }
+
+  /**
+   * Check whether the spec contains a bucketed partition field.
+   *
+   * @param spec a partition spec
+   * @return true if the spec has field with a bucket transform
+   */
+  public static boolean hasBucketField(PartitionSpec spec) {
+    List<Boolean> bucketList = PartitionSpecVisitor.visit(spec, new PartitionSpecVisitor<Boolean>() {
+      @Override
+      public Boolean identity(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean bucket(int fieldId, String sourceName, int sourceId, int width) {
+        return true;
+      }
+
+      @Override
+      public Boolean truncate(int fieldId, String sourceName, int sourceId, int width) {
+        return false;
+      }
+
+      @Override
+      public Boolean year(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean month(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean day(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean hour(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean alwaysNull(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean unknown(int fieldId, String sourceName, int sourceId, String transform) {
+        return false;
+      }
+    });
+
+    return bucketList.stream().anyMatch(Boolean::booleanValue);
+  }
+
+  /**
+   * Create a sort order that will group data for a partition spec.
+   * <p>
+   * If the partition spec contains bucket columns, the sort order will also have a field to sort by a column that is
+   * bucketed in the spec. The column is selected by the highest number of buckets in the transform.
+   *
+   * @param spec a partition spec
+   * @return a sort order that will cluster data for the spec
+   */
+  public static SortOrder sortOrderFor(PartitionSpec spec) {
+    if (spec.isUnpartitioned()) {
+      return SortOrder.unsorted();
+    }
+
+    SortOrder.Builder builder = SortOrder.builderFor(spec.schema());
+    SpecToOrderVisitor converter = new SpecToOrderVisitor(builder);
+    PartitionSpecVisitor.visit(spec, converter);
+
+    // columns used for bucketing are high cardinality; add one to the sort at the end
+    String bucketColumn = converter.bucketColumn();
+    if (bucketColumn != null) {
+      builder.asc(bucketColumn);
+    }
+
+    return builder.build();
+  }
+
+  private static class SpecToOrderVisitor implements PartitionSpecVisitor<Void> {
+    private final SortOrder.Builder builder;
+    private String bucketColumn = null;

Review comment:
       Oh, sorry about that. The bucket column is tracked because it is assumed to be a high cardinality column. Since we want to end a sort order with a high cardinality column, we keep track of the bucket column with the largest number of buckets and add that to the inferred sort order.
   
   ```java
         // the column with highest cardinality is usually the one with the highest number of buckets
         if (numBuckets > highestNumBuckets) {
           this.highestNumBuckets = numBuckets;
           this.bucketColumn = sourceName;
         }
   ```
   
   Then above in `sortOrderFor(PartitionSpec)`, if the bucket column is present, it is added to the order.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks commented on a change in pull request #1675: API: Extend partitioning and sort order

Posted by GitBox <gi...@apache.org>.
danielcweeks commented on a change in pull request #1675:
URL: https://github.com/apache/iceberg/pull/1675#discussion_r517018705



##########
File path: api/src/main/java/org/apache/iceberg/transforms/Transform.java
##########
@@ -58,6 +58,29 @@
    */
   Type getResultType(Type sourceType);
 
+  /**
+   * Whether the transform preserves the order of values (is monotonic).
+   * <p>
+   * A transform preserves order for values when for any given a and b, if a &lt; b then apply(a) &lt;= apply(b).

Review comment:
       Wouldn't it also be the case that a = b; then apply(a) = apply(b)? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks commented on a change in pull request #1675: API: Extend partitioning and sort order

Posted by GitBox <gi...@apache.org>.
danielcweeks commented on a change in pull request #1675:
URL: https://github.com/apache/iceberg/pull/1675#discussion_r517042523



##########
File path: core/src/main/java/org/apache/iceberg/Partitioning.java
##########
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.util.List;
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.transforms.PartitionSpecVisitor;
+
+public class Partitioning {
+  private Partitioning() {
+  }
+
+  /**
+   * Check whether the spec contains a bucketed partition field.
+   *
+   * @param spec a partition spec
+   * @return true if the spec has field with a bucket transform
+   */
+  public static boolean hasBucketField(PartitionSpec spec) {
+    List<Boolean> bucketList = PartitionSpecVisitor.visit(spec, new PartitionSpecVisitor<Boolean>() {
+      @Override
+      public Boolean identity(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean bucket(int fieldId, String sourceName, int sourceId, int width) {
+        return true;
+      }
+
+      @Override
+      public Boolean truncate(int fieldId, String sourceName, int sourceId, int width) {
+        return false;
+      }
+
+      @Override
+      public Boolean year(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean month(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean day(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean hour(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean alwaysNull(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean unknown(int fieldId, String sourceName, int sourceId, String transform) {
+        return false;
+      }
+    });
+
+    return bucketList.stream().anyMatch(Boolean::booleanValue);
+  }
+
+  /**
+   * Create a sort order that will group data for a partition spec.
+   * <p>
+   * If the partition spec contains bucket columns, the sort order will also have a field to sort by a column that is
+   * bucketed in the spec. The column is selected by the highest number of buckets in the transform.
+   *
+   * @param spec a partition spec
+   * @return a sort order that will cluster data for the spec
+   */
+  public static SortOrder sortOrderFor(PartitionSpec spec) {
+    if (spec.isUnpartitioned()) {
+      return SortOrder.unsorted();
+    }
+
+    SortOrder.Builder builder = SortOrder.builderFor(spec.schema());
+    SpecToOrderVisitor converter = new SpecToOrderVisitor(builder);
+    PartitionSpecVisitor.visit(spec, converter);
+
+    // columns used for bucketing are high cardinality; add one to the sort at the end
+    String bucketColumn = converter.bucketColumn();
+    if (bucketColumn != null) {
+      builder.asc(bucketColumn);
+    }
+
+    return builder.build();
+  }
+
+  private static class SpecToOrderVisitor implements PartitionSpecVisitor<Void> {
+    private final SortOrder.Builder builder;
+    private String bucketColumn = null;

Review comment:
       Ah, ok.  That makes sense.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #1675: API: Extend partitioning and sort order

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #1675:
URL: https://github.com/apache/iceberg/pull/1675#discussion_r517563205



##########
File path: api/src/main/java/org/apache/iceberg/transforms/Transform.java
##########
@@ -58,6 +58,29 @@
    */
   Type getResultType(Type sourceType);
 
+  /**
+   * Whether the transform preserves the order of values (is monotonic).
+   * <p>
+   * A transform preserves order for values when for any given a and b, if a &lt; b then apply(a) &lt;= apply(b).
+   *
+   * @return true if the transform preserves the order of values
+   */
+  default boolean preservesOrder() {
+    return false;
+  }
+
+  /**
+   * Whether ordering by this transform's result satisfies the ordering of another transform's result.
+   * <p>
+   * For example, sorting by day(ts) will produce an ordering that is also by month(ts) or year(ts). However, sorting
+   * by day(ts) will not satisfy the order of hour(ts) or identity(ts).
+   *
+   * @return true if ordering by this transform is equivalent to ordering by the other transform

Review comment:
       Makes sense, it's the orderings that are equivalent. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks commented on pull request #1675: API: Extend partitioning and sort order

Posted by GitBox <gi...@apache.org>.
danielcweeks commented on pull request #1675:
URL: https://github.com/apache/iceberg/pull/1675#issuecomment-721958329


   +1 LGTM


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1675: API: Extend partitioning and sort order

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1675:
URL: https://github.com/apache/iceberg/pull/1675#discussion_r517017628



##########
File path: api/src/main/java/org/apache/iceberg/transforms/Transform.java
##########
@@ -58,6 +58,29 @@
    */
   Type getResultType(Type sourceType);
 
+  /**
+   * Whether the transform preserves the order of values (is monotonic).
+   * <p>
+   * A transform preserves order for values when for any given a and b, if a &lt; b then apply(a) &lt;= apply(b).
+   *
+   * @return true if the transform preserves the order of values
+   */
+  default boolean preservesOrder() {
+    return false;
+  }
+
+  /**
+   * Whether ordering by this transform's result satisfies the ordering of another transform's result.
+   * <p>
+   * For example, sorting by day(ts) will produce an ordering that is also by month(ts) or year(ts). However, sorting
+   * by day(ts) will not satisfy the order of hour(ts) or identity(ts).
+   *
+   * @return true if ordering by this transform is equivalent to ordering by the other transform

Review comment:
       I think "equivalent to" is more specific in this case because "satisfies" is in the method name. I'm trying to avoid using the term in the definition, and the meaning of "satisfies" is that ordering by one is equivalent to ordering by the other.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks commented on a change in pull request #1675: API: Extend partitioning and sort order

Posted by GitBox <gi...@apache.org>.
danielcweeks commented on a change in pull request #1675:
URL: https://github.com/apache/iceberg/pull/1675#discussion_r517027785



##########
File path: core/src/main/java/org/apache/iceberg/CopySortOrderFields.java
##########
@@ -0,0 +1,101 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.transforms.SortOrderVisitor;
+
+class CopySortOrderFields implements SortOrderVisitor<Void> {
+  private final SortOrder.Builder builder;
+
+  CopySortOrderFields(SortOrder.Builder builder) {
+    this.builder = builder;
+  }
+
+  @Override
+  public Void field(String sourceName, int sourceId, SortDirection direction, NullOrder nullOrder) {
+    if (direction == SortDirection.ASC) {
+      builder.asc(sourceName, nullOrder);

Review comment:
       I feel like should be able to really reduce the amount of boilerplate if/else for asc/desc by exposing sort direction directly from the builder.  Seems like you could just add a Builder::addSort(term, direction, nullOrder) and make this much cleaner.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1675: API: Extend partitioning and sort order

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1675:
URL: https://github.com/apache/iceberg/pull/1675#discussion_r517525755



##########
File path: core/src/main/java/org/apache/iceberg/CopySortOrderFields.java
##########
@@ -0,0 +1,101 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.transforms.SortOrderVisitor;
+
+class CopySortOrderFields implements SortOrderVisitor<Void> {
+  private final SortOrder.Builder builder;
+
+  CopySortOrderFields(SortOrder.Builder builder) {
+    this.builder = builder;
+  }
+
+  @Override
+  public Void field(String sourceName, int sourceId, SortDirection direction, NullOrder nullOrder) {
+    if (direction == SortDirection.ASC) {
+      builder.asc(sourceName, nullOrder);

Review comment:
       Added `sortBy` to the builder for this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1675: API: Extend partitioning and sort order

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1675:
URL: https://github.com/apache/iceberg/pull/1675#discussion_r517019407



##########
File path: core/src/main/java/org/apache/iceberg/CopySortOrderFields.java
##########
@@ -0,0 +1,101 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.transforms.SortOrderVisitor;
+
+class CopySortOrderFields implements SortOrderVisitor<Void> {
+  private final SortOrder.Builder builder;
+
+  CopySortOrderFields(SortOrder.Builder builder) {
+    this.builder = builder;
+  }
+
+  @Override
+  public Void field(String sourceName, int sourceId, SortDirection direction, NullOrder nullOrder) {
+    if (direction == SortDirection.ASC) {
+      builder.asc(sourceName, nullOrder);
+    } else {
+      builder.desc(sourceName, nullOrder);
+    }
+    return null;
+  }
+
+  @Override
+  public Void bucket(String sourceName, int sourceId, int numBuckets, SortDirection direction, NullOrder nullOrder) {
+    if (direction == SortDirection.ASC) {
+      builder.asc(Expressions.bucket(sourceName, numBuckets), nullOrder);
+    } else {
+      builder.desc(Expressions.bucket(sourceName, numBuckets), nullOrder);
+    }
+    return null;
+  }
+
+  @Override
+  public Void truncate(String sourceName, int sourceId, int width, SortDirection direction, NullOrder nullOrder) {
+    if (direction == SortDirection.ASC) {
+      builder.asc(Expressions.truncate(sourceName, width), nullOrder);
+    } else {
+      builder.desc(Expressions.truncate(sourceName, width), nullOrder);
+    }
+    return null;
+  }
+
+  @Override
+  public Void year(String sourceName, int sourceId, SortDirection direction, NullOrder nullOrder) {
+    if (direction == SortDirection.ASC) {
+      builder.asc(Expressions.year(sourceName), nullOrder);

Review comment:
       No, because the expressions here are all unbound, so there is no guarantee that the types match or that the expression can be evaluated. That is guaranteed after an expression is bound.
   
   For example, I can create `Expressions.lessThan("x", 34)`. The type of `x` in a table is unknown and could be `long`. When the expression is bound, the type of `x` is taken from the schema the expression is bound with, and the value is coerced to that type.
   
   Transforms are also fixed up when an expression is bound. Although the code initially assumes that the column identified by `sourceName` is a timestamp, binding will create the right transform based on the actual column type.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1675: API: Extend partitioning and sort order

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1675:
URL: https://github.com/apache/iceberg/pull/1675#issuecomment-721993672


   Thanks for reviewing, everyone!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks commented on a change in pull request #1675: API: Extend partitioning and sort order

Posted by GitBox <gi...@apache.org>.
danielcweeks commented on a change in pull request #1675:
URL: https://github.com/apache/iceberg/pull/1675#discussion_r517031523



##########
File path: core/src/main/java/org/apache/iceberg/Partitioning.java
##########
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.util.List;
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.transforms.PartitionSpecVisitor;
+
+public class Partitioning {
+  private Partitioning() {
+  }
+
+  /**
+   * Check whether the spec contains a bucketed partition field.
+   *
+   * @param spec a partition spec
+   * @return true if the spec has field with a bucket transform
+   */
+  public static boolean hasBucketField(PartitionSpec spec) {
+    List<Boolean> bucketList = PartitionSpecVisitor.visit(spec, new PartitionSpecVisitor<Boolean>() {
+      @Override
+      public Boolean identity(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean bucket(int fieldId, String sourceName, int sourceId, int width) {
+        return true;
+      }
+
+      @Override
+      public Boolean truncate(int fieldId, String sourceName, int sourceId, int width) {
+        return false;
+      }
+
+      @Override
+      public Boolean year(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean month(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean day(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean hour(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean alwaysNull(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean unknown(int fieldId, String sourceName, int sourceId, String transform) {
+        return false;
+      }
+    });
+
+    return bucketList.stream().anyMatch(Boolean::booleanValue);
+  }
+
+  /**
+   * Create a sort order that will group data for a partition spec.
+   * <p>
+   * If the partition spec contains bucket columns, the sort order will also have a field to sort by a column that is
+   * bucketed in the spec. The column is selected by the highest number of buckets in the transform.
+   *
+   * @param spec a partition spec
+   * @return a sort order that will cluster data for the spec
+   */
+  public static SortOrder sortOrderFor(PartitionSpec spec) {
+    if (spec.isUnpartitioned()) {
+      return SortOrder.unsorted();
+    }
+
+    SortOrder.Builder builder = SortOrder.builderFor(spec.schema());
+    SpecToOrderVisitor converter = new SpecToOrderVisitor(builder);
+    PartitionSpecVisitor.visit(spec, converter);
+
+    // columns used for bucketing are high cardinality; add one to the sort at the end
+    String bucketColumn = converter.bucketColumn();
+    if (bucketColumn != null) {
+      builder.asc(bucketColumn);
+    }
+
+    return builder.build();
+  }
+
+  private static class SpecToOrderVisitor implements PartitionSpecVisitor<Void> {
+    private final SortOrder.Builder builder;
+    private String bucketColumn = null;

Review comment:
       Are we guaranteed to have only one bucket column?  I actually didn't think of this until now, but Hive allows for bucketing on multiple columns (though those are combined and hashed I believe.  However, I didn't immedaitely see anything that prevents someone from defining multiple bucket columns.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #1675: API: Extend partitioning and sort order

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #1675:
URL: https://github.com/apache/iceberg/pull/1675#issuecomment-718230783


   @jun-he, this is the set of changes we talked about that are useful for the `UpdatePartitionSpec` operation. This adds `Expressions.transform` that allows you to create an `UnboundTransform` term from a `Transform` instance. Then the `SortOrder` builder accepts those (as `Term` arguments) to build an order. We could do the same thing to keep state as a list of partition spec fields and pass them back to the builder.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jun-he commented on a change in pull request #1675: API: Extend partitioning and sort order

Posted by GitBox <gi...@apache.org>.
jun-he commented on a change in pull request #1675:
URL: https://github.com/apache/iceberg/pull/1675#discussion_r517474179



##########
File path: api/src/main/java/org/apache/iceberg/transforms/PartitionSpecVisitor.java
##########
@@ -26,20 +26,94 @@
 import org.apache.iceberg.relocated.com.google.common.collect.Lists;
 
 public interface PartitionSpecVisitor<T> {
-  T identity(String sourceName, int sourceId);
+  default T identity(int fieldId, String sourceName, int sourceId) {
+    return identity(sourceName, sourceId);
+  }
+
+  default T identity(String sourceName, int sourceId) {
+    throw new UnsupportedOperationException("Identity transform is not supported");
+  }
+
+  default T bucket(int fieldId, String sourceName, int sourceId, int width) {
+    return bucket(sourceName, sourceId, width);
+  }
+
+  default T bucket(String sourceName, int sourceId, int width) {
+    throw new UnsupportedOperationException("Bucket transform is not supported");
+  }
 
-  T bucket(String sourceName, int sourceId, int width);
+  default T truncate(int fieldId, String sourceName, int sourceId, int width) {
+    return truncate(sourceName, sourceId, width);
+  }
 
-  T truncate(String sourceName, int sourceId, int width);
+  default T truncate(String sourceName, int sourceId, int width) {
+    throw new UnsupportedOperationException("Truncate transform is not supported");
+  }
 
-  T year(String sourceName, int sourceId);
+  default T year(int fieldId, String sourceName, int sourceId) {
+    return year(sourceName, sourceId);
+  }
 
-  T month(String sourceName, int sourceId);
+  default T year(String sourceName, int sourceId) {
+    throw new UnsupportedOperationException("Year transform is not supported");
+  }
 
-  T day(String sourceName, int sourceId);
+  default T month(int fieldId, String sourceName, int sourceId) {
+    return month(sourceName, sourceId);
+  }
+
+  default T month(String sourceName, int sourceId) {
+    throw new UnsupportedOperationException("Month transform is not supported");
+  }
 
-  T hour(String sourceName, int sourceId);
+  default T day(int fieldId, String sourceName, int sourceId) {
+    return day(sourceName, sourceId);
+  }
+
+  default T day(String sourceName, int sourceId) {
+    throw new UnsupportedOperationException("Day transform is not supported");
+  }
+
+  default T hour(int fieldId, String sourceName, int sourceId) {
+    return hour(sourceName, sourceId);
+  }
+
+  default T hour(String sourceName, int sourceId) {
+    throw new UnsupportedOperationException("Hour transform is not supported");
+  }
+
+  default T alwaysNull(int fieldId, String sourceName, int sourceId) {

Review comment:
       nit: should we add `default T alwaysNull(String sourceName, int sourceId)` as well?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1675: API: Extend partitioning and sort order

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1675:
URL: https://github.com/apache/iceberg/pull/1675#discussion_r517526408



##########
File path: api/src/main/java/org/apache/iceberg/transforms/PartitionSpecVisitor.java
##########
@@ -26,20 +26,94 @@
 import org.apache.iceberg.relocated.com.google.common.collect.Lists;
 
 public interface PartitionSpecVisitor<T> {
-  T identity(String sourceName, int sourceId);
+  default T identity(int fieldId, String sourceName, int sourceId) {
+    return identity(sourceName, sourceId);
+  }
+
+  default T identity(String sourceName, int sourceId) {
+    throw new UnsupportedOperationException("Identity transform is not supported");
+  }
+
+  default T bucket(int fieldId, String sourceName, int sourceId, int width) {

Review comment:
       Fixed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1675: API: Extend partitioning and sort order

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1675:
URL: https://github.com/apache/iceberg/pull/1675#discussion_r517489486



##########
File path: api/src/main/java/org/apache/iceberg/transforms/PartitionSpecVisitor.java
##########
@@ -26,20 +26,94 @@
 import org.apache.iceberg.relocated.com.google.common.collect.Lists;
 
 public interface PartitionSpecVisitor<T> {
-  T identity(String sourceName, int sourceId);
+  default T identity(int fieldId, String sourceName, int sourceId) {
+    return identity(sourceName, sourceId);
+  }
+
+  default T identity(String sourceName, int sourceId) {
+    throw new UnsupportedOperationException("Identity transform is not supported");
+  }
+
+  default T bucket(int fieldId, String sourceName, int sourceId, int width) {
+    return bucket(sourceName, sourceId, width);
+  }
+
+  default T bucket(String sourceName, int sourceId, int width) {
+    throw new UnsupportedOperationException("Bucket transform is not supported");
+  }
 
-  T bucket(String sourceName, int sourceId, int width);
+  default T truncate(int fieldId, String sourceName, int sourceId, int width) {
+    return truncate(sourceName, sourceId, width);
+  }
 
-  T truncate(String sourceName, int sourceId, int width);
+  default T truncate(String sourceName, int sourceId, int width) {
+    throw new UnsupportedOperationException("Truncate transform is not supported");
+  }
 
-  T year(String sourceName, int sourceId);
+  default T year(int fieldId, String sourceName, int sourceId) {
+    return year(sourceName, sourceId);
+  }
 
-  T month(String sourceName, int sourceId);
+  default T year(String sourceName, int sourceId) {
+    throw new UnsupportedOperationException("Year transform is not supported");
+  }
 
-  T day(String sourceName, int sourceId);
+  default T month(int fieldId, String sourceName, int sourceId) {
+    return month(sourceName, sourceId);
+  }
+
+  default T month(String sourceName, int sourceId) {
+    throw new UnsupportedOperationException("Month transform is not supported");
+  }
 
-  T hour(String sourceName, int sourceId);
+  default T day(int fieldId, String sourceName, int sourceId) {
+    return day(sourceName, sourceId);
+  }
+
+  default T day(String sourceName, int sourceId) {
+    throw new UnsupportedOperationException("Day transform is not supported");
+  }
+
+  default T hour(int fieldId, String sourceName, int sourceId) {
+    return hour(sourceName, sourceId);
+  }
+
+  default T hour(String sourceName, int sourceId) {
+    throw new UnsupportedOperationException("Hour transform is not supported");
+  }
+
+  default T alwaysNull(int fieldId, String sourceName, int sourceId) {

Review comment:
       No, the methods without fieldId are only there for backward compatibility.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1675: API: Extend partitioning and sort order

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1675:
URL: https://github.com/apache/iceberg/pull/1675#discussion_r517037040



##########
File path: core/src/main/java/org/apache/iceberg/Partitioning.java
##########
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.util.List;
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.transforms.PartitionSpecVisitor;
+
+public class Partitioning {
+  private Partitioning() {
+  }
+
+  /**
+   * Check whether the spec contains a bucketed partition field.
+   *
+   * @param spec a partition spec
+   * @return true if the spec has field with a bucket transform
+   */
+  public static boolean hasBucketField(PartitionSpec spec) {
+    List<Boolean> bucketList = PartitionSpecVisitor.visit(spec, new PartitionSpecVisitor<Boolean>() {
+      @Override
+      public Boolean identity(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean bucket(int fieldId, String sourceName, int sourceId, int width) {
+        return true;
+      }
+
+      @Override
+      public Boolean truncate(int fieldId, String sourceName, int sourceId, int width) {
+        return false;
+      }
+
+      @Override
+      public Boolean year(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean month(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean day(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean hour(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean alwaysNull(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean unknown(int fieldId, String sourceName, int sourceId, String transform) {
+        return false;
+      }
+    });
+
+    return bucketList.stream().anyMatch(Boolean::booleanValue);
+  }
+
+  /**
+   * Create a sort order that will group data for a partition spec.
+   * <p>
+   * If the partition spec contains bucket columns, the sort order will also have a field to sort by a column that is
+   * bucketed in the spec. The column is selected by the highest number of buckets in the transform.
+   *
+   * @param spec a partition spec
+   * @return a sort order that will cluster data for the spec
+   */
+  public static SortOrder sortOrderFor(PartitionSpec spec) {
+    if (spec.isUnpartitioned()) {
+      return SortOrder.unsorted();
+    }
+
+    SortOrder.Builder builder = SortOrder.builderFor(spec.schema());
+    SpecToOrderVisitor converter = new SpecToOrderVisitor(builder);
+    PartitionSpecVisitor.visit(spec, converter);
+
+    // columns used for bucketing are high cardinality; add one to the sort at the end
+    String bucketColumn = converter.bucketColumn();
+    if (bucketColumn != null) {
+      builder.asc(bucketColumn);
+    }
+
+    return builder.build();
+  }
+
+  private static class SpecToOrderVisitor implements PartitionSpecVisitor<Void> {
+    private final SortOrder.Builder builder;
+    private String bucketColumn = null;

Review comment:
       There's no reason why we can't have more than one bucket transform, but Iceberg doesn't mix fields together in bucket transforms. The reason is that each bucket column should be useful independently.
   
   For example, you might have `PARTITION BY (bucket(8, user_id), bucket(16, item_id))`. Joining to both users and items independently could take advantage of those bucketed columns. And joining two tables with `user_id` and `item_id` (like `purchases` and `reviews`) could use a combination of both: `user_bucket << 16 | item_bucket`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks commented on a change in pull request #1675: API: Extend partitioning and sort order

Posted by GitBox <gi...@apache.org>.
danielcweeks commented on a change in pull request #1675:
URL: https://github.com/apache/iceberg/pull/1675#discussion_r517039050



##########
File path: core/src/main/java/org/apache/iceberg/Partitioning.java
##########
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.util.List;
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.transforms.PartitionSpecVisitor;
+
+public class Partitioning {
+  private Partitioning() {
+  }
+
+  /**
+   * Check whether the spec contains a bucketed partition field.
+   *
+   * @param spec a partition spec
+   * @return true if the spec has field with a bucket transform
+   */
+  public static boolean hasBucketField(PartitionSpec spec) {
+    List<Boolean> bucketList = PartitionSpecVisitor.visit(spec, new PartitionSpecVisitor<Boolean>() {
+      @Override
+      public Boolean identity(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean bucket(int fieldId, String sourceName, int sourceId, int width) {
+        return true;
+      }
+
+      @Override
+      public Boolean truncate(int fieldId, String sourceName, int sourceId, int width) {
+        return false;
+      }
+
+      @Override
+      public Boolean year(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean month(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean day(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean hour(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean alwaysNull(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean unknown(int fieldId, String sourceName, int sourceId, String transform) {
+        return false;
+      }
+    });
+
+    return bucketList.stream().anyMatch(Boolean::booleanValue);
+  }
+
+  /**
+   * Create a sort order that will group data for a partition spec.
+   * <p>
+   * If the partition spec contains bucket columns, the sort order will also have a field to sort by a column that is
+   * bucketed in the spec. The column is selected by the highest number of buckets in the transform.
+   *
+   * @param spec a partition spec
+   * @return a sort order that will cluster data for the spec
+   */
+  public static SortOrder sortOrderFor(PartitionSpec spec) {
+    if (spec.isUnpartitioned()) {
+      return SortOrder.unsorted();
+    }
+
+    SortOrder.Builder builder = SortOrder.builderFor(spec.schema());
+    SpecToOrderVisitor converter = new SpecToOrderVisitor(builder);
+    PartitionSpecVisitor.visit(spec, converter);
+
+    // columns used for bucketing are high cardinality; add one to the sort at the end
+    String bucketColumn = converter.bucketColumn();
+    if (bucketColumn != null) {
+      builder.asc(bucketColumn);
+    }
+
+    return builder.build();
+  }
+
+  private static class SpecToOrderVisitor implements PartitionSpecVisitor<Void> {
+    private final SortOrder.Builder builder;
+    private String bucketColumn = null;

Review comment:
       I guess I wasn't clear on my point.  I understand why we're not doing what Hive does in this case, but rather how we are only using a single bucket in the sort order above on line 106. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1675: API: Extend partitioning and sort order

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1675:
URL: https://github.com/apache/iceberg/pull/1675#discussion_r517035561



##########
File path: core/src/main/java/org/apache/iceberg/CopySortOrderFields.java
##########
@@ -0,0 +1,101 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.transforms.SortOrderVisitor;
+
+class CopySortOrderFields implements SortOrderVisitor<Void> {
+  private final SortOrder.Builder builder;
+
+  CopySortOrderFields(SortOrder.Builder builder) {
+    this.builder = builder;
+  }
+
+  @Override
+  public Void field(String sourceName, int sourceId, SortDirection direction, NullOrder nullOrder) {
+    if (direction == SortDirection.ASC) {
+      builder.asc(sourceName, nullOrder);

Review comment:
       Yeah, we should probably do that.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1675: API: Extend partitioning and sort order

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1675:
URL: https://github.com/apache/iceberg/pull/1675#discussion_r517035417



##########
File path: api/src/main/java/org/apache/iceberg/transforms/Transform.java
##########
@@ -58,6 +58,29 @@
    */
   Type getResultType(Type sourceType);
 
+  /**
+   * Whether the transform preserves the order of values (is monotonic).
+   * <p>
+   * A transform preserves order for values when for any given a and b, if a &lt; b then apply(a) &lt;= apply(b).

Review comment:
       If a = b, then apply(a) should always equal apply(b). This is concerned with the case where a and b are not equal.
   
   For example, a year transform has this property. year(2020-12-31) = 2020 and year(2021-01-01) = 2021. Because the first date comes before the second, its year must be before (or the same as) any date after it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #1675: API: Extend partitioning and sort order

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #1675:
URL: https://github.com/apache/iceberg/pull/1675


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #1675: API: Extend partitioning and sort order

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #1675:
URL: https://github.com/apache/iceberg/pull/1675#discussion_r517002670



##########
File path: api/src/main/java/org/apache/iceberg/transforms/PartitionSpecVisitor.java
##########
@@ -26,20 +26,94 @@
 import org.apache.iceberg.relocated.com.google.common.collect.Lists;
 
 public interface PartitionSpecVisitor<T> {
-  T identity(String sourceName, int sourceId);
+  default T identity(int fieldId, String sourceName, int sourceId) {
+    return identity(sourceName, sourceId);
+  }
+
+  default T identity(String sourceName, int sourceId) {
+    throw new UnsupportedOperationException("Identity transform is not supported");
+  }
+
+  default T bucket(int fieldId, String sourceName, int sourceId, int width) {

Review comment:
       `width` -> `numBuckets`? Same as in `SortOrderVisitor`L34

##########
File path: api/src/main/java/org/apache/iceberg/transforms/Transform.java
##########
@@ -58,6 +58,29 @@
    */
   Type getResultType(Type sourceType);
 
+  /**
+   * Whether the transform preserves the order of values (is monotonic).
+   * <p>
+   * A transform preserves order for values when for any given a and b, if a &lt; b then apply(a) &lt;= apply(b).
+   *
+   * @return true if the transform preserves the order of values
+   */
+  default boolean preservesOrder() {
+    return false;
+  }
+
+  /**
+   * Whether ordering by this transform's result satisfies the ordering of another transform's result.
+   * <p>
+   * For example, sorting by day(ts) will produce an ordering that is also by month(ts) or year(ts). However, sorting
+   * by day(ts) will not satisfy the order of hour(ts) or identity(ts).
+   *
+   * @return true if ordering by this transform is equivalent to ordering by the other transform

Review comment:
       "equivalent to" -> I thought it would be "satisfies"?

##########
File path: core/src/main/java/org/apache/iceberg/CopySortOrderFields.java
##########
@@ -0,0 +1,101 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.transforms.SortOrderVisitor;
+
+class CopySortOrderFields implements SortOrderVisitor<Void> {
+  private final SortOrder.Builder builder;
+
+  CopySortOrderFields(SortOrder.Builder builder) {
+    this.builder = builder;
+  }
+
+  @Override
+  public Void field(String sourceName, int sourceId, SortDirection direction, NullOrder nullOrder) {
+    if (direction == SortDirection.ASC) {
+      builder.asc(sourceName, nullOrder);
+    } else {
+      builder.desc(sourceName, nullOrder);
+    }
+    return null;
+  }
+
+  @Override
+  public Void bucket(String sourceName, int sourceId, int numBuckets, SortDirection direction, NullOrder nullOrder) {
+    if (direction == SortDirection.ASC) {
+      builder.asc(Expressions.bucket(sourceName, numBuckets), nullOrder);
+    } else {
+      builder.desc(Expressions.bucket(sourceName, numBuckets), nullOrder);
+    }
+    return null;
+  }
+
+  @Override
+  public Void truncate(String sourceName, int sourceId, int width, SortDirection direction, NullOrder nullOrder) {
+    if (direction == SortDirection.ASC) {
+      builder.asc(Expressions.truncate(sourceName, width), nullOrder);
+    } else {
+      builder.desc(Expressions.truncate(sourceName, width), nullOrder);
+    }
+    return null;
+  }
+
+  @Override
+  public Void year(String sourceName, int sourceId, SortDirection direction, NullOrder nullOrder) {
+    if (direction == SortDirection.ASC) {
+      builder.asc(Expressions.year(sourceName), nullOrder);

Review comment:
       I'm not familiar with this; I think in `SortOrderVisitor` L75 both `Dates.YEAR` and `Timestamps.YEAR` will end up in here, but `Expressions.year` will only produce `Timestamp` transform (from [here](https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/expressions/Expressions.java#L88)). Would this cause compatibility issue?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] danielcweeks commented on a change in pull request #1675: API: Extend partitioning and sort order

Posted by GitBox <gi...@apache.org>.
danielcweeks commented on a change in pull request #1675:
URL: https://github.com/apache/iceberg/pull/1675#discussion_r517031523



##########
File path: core/src/main/java/org/apache/iceberg/Partitioning.java
##########
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import java.util.List;
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.transforms.PartitionSpecVisitor;
+
+public class Partitioning {
+  private Partitioning() {
+  }
+
+  /**
+   * Check whether the spec contains a bucketed partition field.
+   *
+   * @param spec a partition spec
+   * @return true if the spec has field with a bucket transform
+   */
+  public static boolean hasBucketField(PartitionSpec spec) {
+    List<Boolean> bucketList = PartitionSpecVisitor.visit(spec, new PartitionSpecVisitor<Boolean>() {
+      @Override
+      public Boolean identity(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean bucket(int fieldId, String sourceName, int sourceId, int width) {
+        return true;
+      }
+
+      @Override
+      public Boolean truncate(int fieldId, String sourceName, int sourceId, int width) {
+        return false;
+      }
+
+      @Override
+      public Boolean year(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean month(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean day(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean hour(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean alwaysNull(int fieldId, String sourceName, int sourceId) {
+        return false;
+      }
+
+      @Override
+      public Boolean unknown(int fieldId, String sourceName, int sourceId, String transform) {
+        return false;
+      }
+    });
+
+    return bucketList.stream().anyMatch(Boolean::booleanValue);
+  }
+
+  /**
+   * Create a sort order that will group data for a partition spec.
+   * <p>
+   * If the partition spec contains bucket columns, the sort order will also have a field to sort by a column that is
+   * bucketed in the spec. The column is selected by the highest number of buckets in the transform.
+   *
+   * @param spec a partition spec
+   * @return a sort order that will cluster data for the spec
+   */
+  public static SortOrder sortOrderFor(PartitionSpec spec) {
+    if (spec.isUnpartitioned()) {
+      return SortOrder.unsorted();
+    }
+
+    SortOrder.Builder builder = SortOrder.builderFor(spec.schema());
+    SpecToOrderVisitor converter = new SpecToOrderVisitor(builder);
+    PartitionSpecVisitor.visit(spec, converter);
+
+    // columns used for bucketing are high cardinality; add one to the sort at the end
+    String bucketColumn = converter.bucketColumn();
+    if (bucketColumn != null) {
+      builder.asc(bucketColumn);
+    }
+
+    return builder.build();
+  }
+
+  private static class SpecToOrderVisitor implements PartitionSpecVisitor<Void> {
+    private final SortOrder.Builder builder;
+    private String bucketColumn = null;

Review comment:
       Are we guaranteed to have only one bucket column?  I actually didn't think of this until now, but Hive allows for bucketing on multiple columns (though those are combined and hashed I believe).  However, I didn't immedaitely see anything that prevents someone from defining multiple bucket columns.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] yyanyy commented on a change in pull request #1675: API: Extend partitioning and sort order

Posted by GitBox <gi...@apache.org>.
yyanyy commented on a change in pull request #1675:
URL: https://github.com/apache/iceberg/pull/1675#discussion_r517562317



##########
File path: core/src/main/java/org/apache/iceberg/CopySortOrderFields.java
##########
@@ -0,0 +1,101 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.transforms.SortOrderVisitor;
+
+class CopySortOrderFields implements SortOrderVisitor<Void> {
+  private final SortOrder.Builder builder;
+
+  CopySortOrderFields(SortOrder.Builder builder) {
+    this.builder = builder;
+  }
+
+  @Override
+  public Void field(String sourceName, int sourceId, SortDirection direction, NullOrder nullOrder) {
+    if (direction == SortDirection.ASC) {
+      builder.asc(sourceName, nullOrder);
+    } else {
+      builder.desc(sourceName, nullOrder);
+    }
+    return null;
+  }
+
+  @Override
+  public Void bucket(String sourceName, int sourceId, int numBuckets, SortDirection direction, NullOrder nullOrder) {
+    if (direction == SortDirection.ASC) {
+      builder.asc(Expressions.bucket(sourceName, numBuckets), nullOrder);
+    } else {
+      builder.desc(Expressions.bucket(sourceName, numBuckets), nullOrder);
+    }
+    return null;
+  }
+
+  @Override
+  public Void truncate(String sourceName, int sourceId, int width, SortDirection direction, NullOrder nullOrder) {
+    if (direction == SortDirection.ASC) {
+      builder.asc(Expressions.truncate(sourceName, width), nullOrder);
+    } else {
+      builder.desc(Expressions.truncate(sourceName, width), nullOrder);
+    }
+    return null;
+  }
+
+  @Override
+  public Void year(String sourceName, int sourceId, SortDirection direction, NullOrder nullOrder) {
+    if (direction == SortDirection.ASC) {
+      builder.asc(Expressions.year(sourceName), nullOrder);

Review comment:
       Ah I see, I overlooked [`Timestamps/Dates.valueOf()`](https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/transforms/Transforms.java#L62-L64) in `fromString`. Thank you for the explanation!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #1675: API: Extend partitioning and sort order

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #1675:
URL: https://github.com/apache/iceberg/pull/1675#discussion_r517019407



##########
File path: core/src/main/java/org/apache/iceberg/CopySortOrderFields.java
##########
@@ -0,0 +1,101 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.transforms.SortOrderVisitor;
+
+class CopySortOrderFields implements SortOrderVisitor<Void> {
+  private final SortOrder.Builder builder;
+
+  CopySortOrderFields(SortOrder.Builder builder) {
+    this.builder = builder;
+  }
+
+  @Override
+  public Void field(String sourceName, int sourceId, SortDirection direction, NullOrder nullOrder) {
+    if (direction == SortDirection.ASC) {
+      builder.asc(sourceName, nullOrder);
+    } else {
+      builder.desc(sourceName, nullOrder);
+    }
+    return null;
+  }
+
+  @Override
+  public Void bucket(String sourceName, int sourceId, int numBuckets, SortDirection direction, NullOrder nullOrder) {
+    if (direction == SortDirection.ASC) {
+      builder.asc(Expressions.bucket(sourceName, numBuckets), nullOrder);
+    } else {
+      builder.desc(Expressions.bucket(sourceName, numBuckets), nullOrder);
+    }
+    return null;
+  }
+
+  @Override
+  public Void truncate(String sourceName, int sourceId, int width, SortDirection direction, NullOrder nullOrder) {
+    if (direction == SortDirection.ASC) {
+      builder.asc(Expressions.truncate(sourceName, width), nullOrder);
+    } else {
+      builder.desc(Expressions.truncate(sourceName, width), nullOrder);
+    }
+    return null;
+  }
+
+  @Override
+  public Void year(String sourceName, int sourceId, SortDirection direction, NullOrder nullOrder) {
+    if (direction == SortDirection.ASC) {
+      builder.asc(Expressions.year(sourceName), nullOrder);

Review comment:
       No, because the expressions here are all unbound, so there is no guarantee that the types match or that the expression can be evaluated. That is guaranteed after an expression is bound.
   
   For example, I can create `Expressions.lessThan("x", 34)`. The type of `x` in a table is unknown and could be `long`. When the expression is bound, the type of `x` is taken from the schema the expression is bound with, and the value is coerced to that type.
   
   Transforms are also fixed up when an expression is bound. Although the code initially assumes that the column identified by `sourceName` is a timestamp, binding will [create the right transform](https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/expressions/UnboundTransform.java#L52-L53) based on the actual column type.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org