You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/05/19 00:13:32 UTC

[GitHub] [iceberg] RussellSpitzer opened a new pull request #2609: Adds SortRewriteStrategy

RussellSpitzer opened a new pull request #2609:
URL: https://github.com/apache/iceberg/pull/2609


   A rewrite strategy for data files which aims to reorder data with data files to optimally lay them out
   in relation to a column. For example, if the Sort strategy is used on a set of files which is ordered
   by column x and original has files File A (x: 0 - 50), File B ( x: 10 - 40) and File C ( x: 30 - 60),
   this Strategy will attempt to rewrite those files into File A' (x: 0-20), File B' (x: 21 - 40),
   File C' (x: 41 - 60).
   
   Currently the there is no clustering detection and we will rewrite all files if {@link SortStrategy#REWRITE_ALL}
   is true (default). If this property is disabled any files with the incorrect sort-order as well as any files
   that would be chosen by {@link BinPackStrategy} will be rewrite candidates.
   
   In the future other algorithms for determining files to rewrite will be provided.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2609: Adds SortRewriteStrategy

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2609:
URL: https://github.com/apache/iceberg/pull/2609#discussion_r669890467



##########
File path: core/src/main/java/org/apache/iceberg/actions/SortStrategy.java
##########
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+import java.util.Set;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.SortOrder;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.PropertyUtil;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A rewrite strategy for data files which aims to reorder data with data files to optimally lay them out
+ * in relation to a column. For example, if the Sort strategy is used on a set of files which is ordered
+ * by column x and original has files File A (x: 0 - 50), File B ( x: 10 - 40) and File C ( x: 30 - 60),
+ * this Strategy will attempt to rewrite those files into File A' (x: 0-20), File B' (x: 21 - 40),
+ * File C' (x: 41 - 60).
+ * <p>
+ * Currently the there is no clustering detection and we will rewrite all files if {@link SortStrategy#REWRITE_ALL}
+ * is true (default: false). If this property is disabled any files that would be chosen by
+ * {@link BinPackStrategy} will be rewrite candidates.

Review comment:
       `sort_order_id` is an optional field in data file, in the current process sort order is always null. I think we can distinguish that case from a wrong sort order which mean non-null and not match the given order id.
   
   Once in v2, it will also be beneficial to rewrite data files with null sort order id to backfill the order for table written by v1 writer. It seems like we can add boolean options for something like `rewrite-missing-order` and `rewrite-wrong-order`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SreeramGarlapati commented on a change in pull request #2609: Adds SortRewriteStrategy

Posted by GitBox <gi...@apache.org>.
SreeramGarlapati commented on a change in pull request #2609:
URL: https://github.com/apache/iceberg/pull/2609#discussion_r634839552



##########
File path: core/src/main/java/org/apache/iceberg/actions/SortStrategy.java
##########
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+import java.util.Map.Entry;
+import java.util.Optional;
+import java.util.Set;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.SortOrder;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.FluentIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.PropertyUtil;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A rewrite strategy for data files which aims to reorder data with data files to optimally lay them out
+ * in relation to a column. For example, if the Sort strategy is used on a set of files which is ordered
+ * by column x and original has files File A (x: 0 - 50), File B ( x: 10 - 40) and File C ( x: 30 - 60),

Review comment:
       nit: In this example - all the files are rewritten as well. Did you intend to provide an ex. to demonstrate that - this algo. will avoid file rewrites for already ordered files?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2609: Adds SortRewriteStrategy

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2609:
URL: https://github.com/apache/iceberg/pull/2609#discussion_r669892856



##########
File path: core/src/main/java/org/apache/iceberg/actions/SortStrategy.java
##########
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+import java.util.Set;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.SortOrder;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.PropertyUtil;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A rewrite strategy for data files which aims to reorder data with data files to optimally lay them out
+ * in relation to a column. For example, if the Sort strategy is used on a set of files which is ordered
+ * by column x and original has files File A (x: 0 - 50), File B ( x: 10 - 40) and File C ( x: 30 - 60),
+ * this Strategy will attempt to rewrite those files into File A' (x: 0-20), File B' (x: 21 - 40),
+ * File C' (x: 41 - 60).
+ * <p>
+ * Currently the there is no clustering detection and we will rewrite all files if {@link SortStrategy#REWRITE_ALL}
+ * is true (default: false). If this property is disabled any files that would be chosen by
+ * {@link BinPackStrategy} will be rewrite candidates.

Review comment:
       That's fine, I just think it's ok the hold off on options for code we don't have the ability to use yet




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2609: Adds SortRewriteStrategy

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2609:
URL: https://github.com/apache/iceberg/pull/2609#discussion_r669915166



##########
File path: core/src/main/java/org/apache/iceberg/actions/SortStrategy.java
##########
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+import java.util.Set;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.SortOrder;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.PropertyUtil;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A rewrite strategy for data files which aims to reorder data with data files to optimally lay them out
+ * in relation to a column. For example, if the Sort strategy is used on a set of files which is ordered
+ * by column x and original has files File A (x: 0 - 50), File B ( x: 10 - 40) and File C ( x: 30 - 60),
+ * this Strategy will attempt to rewrite those files into File A' (x: 0-20), File B' (x: 21 - 40),
+ * File C' (x: 41 - 60).
+ * <p>
+ * Currently the there is no clustering detection and we will rewrite all files if {@link SortStrategy#REWRITE_ALL}

Review comment:
       Just trying to rethink about clustering here. In Hive, people need to consider clustering because bucketing is not a part of partitioning strategy and you always partition by column value. But in Iceberg, when people do something like partition by `category, bucket(16, id)` in partition strategy, it is already equivalent to Hive's 
   
   ```
   PARTITION BY (category)
   CLUSTER BY (id) INTO 16 BUCKETS
   ```
   
   In this sense, does that mean we actually don't really need clustering detection in Iceberg's rewrite strategy? Or did I miss any other use cases when you say clustering?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SreeramGarlapati commented on a change in pull request #2609: Adds SortRewriteStrategy

Posted by GitBox <gi...@apache.org>.
SreeramGarlapati commented on a change in pull request #2609:
URL: https://github.com/apache/iceberg/pull/2609#discussion_r634839552



##########
File path: core/src/main/java/org/apache/iceberg/actions/SortStrategy.java
##########
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+import java.util.Map.Entry;
+import java.util.Optional;
+import java.util.Set;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.SortOrder;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.FluentIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.PropertyUtil;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A rewrite strategy for data files which aims to reorder data with data files to optimally lay them out
+ * in relation to a column. For example, if the Sort strategy is used on a set of files which is ordered
+ * by column x and original has files File A (x: 0 - 50), File B ( x: 10 - 40) and File C ( x: 30 - 60),

Review comment:
       nit: In this example - all the files are rewritten as well. Did you intend to provide an ex. to demonstrate that - this algo. will avoid file rewrites for already ordered files?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2609: Adds SortRewriteStrategy

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2609:
URL: https://github.com/apache/iceberg/pull/2609#discussion_r669917070



##########
File path: core/src/main/java/org/apache/iceberg/actions/SortStrategy.java
##########
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+import java.util.Set;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.SortOrder;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.PropertyUtil;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A rewrite strategy for data files which aims to reorder data with data files to optimally lay them out
+ * in relation to a column. For example, if the Sort strategy is used on a set of files which is ordered
+ * by column x and original has files File A (x: 0 - 50), File B ( x: 10 - 40) and File C ( x: 30 - 60),
+ * this Strategy will attempt to rewrite those files into File A' (x: 0-20), File B' (x: 21 - 40),
+ * File C' (x: 41 - 60).
+ * <p>
+ * Currently the there is no clustering detection and we will rewrite all files if {@link SortStrategy#REWRITE_ALL}

Review comment:
       This clustering is within a partition, perhaps the naming is not great here but I"m talking about whether we have several datafiles which cover the same region of the sort column. Like hopefully in the future we can detect files A, B, C have significant overlap so we should rewrite those 3 files.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dixingxing0 commented on a change in pull request #2609: Adds SortRewriteStrategy

Posted by GitBox <gi...@apache.org>.
dixingxing0 commented on a change in pull request #2609:
URL: https://github.com/apache/iceberg/pull/2609#discussion_r634878319



##########
File path: core/src/main/java/org/apache/iceberg/actions/SortStrategy.java
##########
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+import java.util.Map.Entry;
+import java.util.Optional;
+import java.util.Set;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.SortOrder;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.FluentIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.PropertyUtil;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A rewrite strategy for data files which aims to reorder data with data files to optimally lay them out
+ * in relation to a column. For example, if the Sort strategy is used on a set of files which is ordered
+ * by column x and original has files File A (x: 0 - 50), File B ( x: 10 - 40) and File C ( x: 30 - 60),
+ * this Strategy will attempt to rewrite those files into File A' (x: 0-20), File B' (x: 21 - 40),
+ * File C' (x: 41 - 60).
+ * <p>
+ * Currently the there is no clustering detection and we will rewrite all files if {@link SortStrategy#REWRITE_ALL}
+ * is true (default). If this property is disabled any files with the incorrect sort-order as well as any files
+ * that would be chosen by {@link BinPackStrategy} will be rewrite candidates.
+ * <p>
+ * In the future other algorithms for determining files to rewrite will be provided.
+ */
+abstract class SortStrategy extends BinPackStrategy {
+  private static final Logger LOG = LoggerFactory.getLogger(SortStrategy.class);
+
+  /**
+   * Rewrites all files, regardless of their size. Defaults to false, rewriting only wrong sort-order and mis-sized
+   * files;
+   */
+  public static final String REWRITE_ALL = "no-size-filter";
+  public static final boolean REWRITE_ALL_DEFAULT = false;

Review comment:
       Curious about this, `REWRITE_ALL_DEFAULT = false`, but  java doc says : 
   ```
    Currently the there is no clustering detection and we will rewrite all files if {@link SortStrategy#REWRITE_ALL} * is true (default).
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2609: Adds SortRewriteStrategy

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2609:
URL: https://github.com/apache/iceberg/pull/2609#discussion_r669915166



##########
File path: core/src/main/java/org/apache/iceberg/actions/SortStrategy.java
##########
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+import java.util.Set;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.SortOrder;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.PropertyUtil;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A rewrite strategy for data files which aims to reorder data with data files to optimally lay them out
+ * in relation to a column. For example, if the Sort strategy is used on a set of files which is ordered
+ * by column x and original has files File A (x: 0 - 50), File B ( x: 10 - 40) and File C ( x: 30 - 60),
+ * this Strategy will attempt to rewrite those files into File A' (x: 0-20), File B' (x: 21 - 40),
+ * File C' (x: 41 - 60).
+ * <p>
+ * Currently the there is no clustering detection and we will rewrite all files if {@link SortStrategy#REWRITE_ALL}

Review comment:
       Just trying to rethink about clustering here. In Hive, people need to consider clustering because bucketing is not a part of partitioning strategy and you always partition by column value. But in Iceberg, when people do something like partition by `category, bucket(16, id)` in partition strategy, it is already equivalent to Hive's 
   
   ```
   PARTITIONED BY (category)
   CLUSTERED BY (id) INTO 16 BUCKETS
   ```
   
   In this sense, does that mean we actually don't really need clustering detection in Iceberg's rewrite strategy? Or did I miss any other use cases when you say clustering?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2609: Adds SortRewriteStrategy

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2609:
URL: https://github.com/apache/iceberg/pull/2609#discussion_r669883244



##########
File path: core/src/main/java/org/apache/iceberg/actions/SortStrategy.java
##########
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+import java.util.Set;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.SortOrder;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.PropertyUtil;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A rewrite strategy for data files which aims to reorder data with data files to optimally lay them out
+ * in relation to a column. For example, if the Sort strategy is used on a set of files which is ordered
+ * by column x and original has files File A (x: 0 - 50), File B ( x: 10 - 40) and File C ( x: 30 - 60),
+ * this Strategy will attempt to rewrite those files into File A' (x: 0-20), File B' (x: 21 - 40),
+ * File C' (x: 41 - 60).
+ * <p>
+ * Currently the there is no clustering detection and we will rewrite all files if {@link SortStrategy#REWRITE_ALL}
+ * is true (default: false). If this property is disabled any files that would be chosen by
+ * {@link BinPackStrategy} will be rewrite candidates.

Review comment:
       Currently we don't ever actually record the sort order that a file is written with, so that means it will never match atm.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2609: Adds SortRewriteStrategy

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2609:
URL: https://github.com/apache/iceberg/pull/2609#discussion_r669912287



##########
File path: core/src/main/java/org/apache/iceberg/actions/SortStrategy.java
##########
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+import java.util.Set;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.SortOrder;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.PropertyUtil;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A rewrite strategy for data files which aims to reorder data with data files to optimally lay them out
+ * in relation to a column. For example, if the Sort strategy is used on a set of files which is ordered
+ * by column x and original has files File A (x: 0 - 50), File B ( x: 10 - 40) and File C ( x: 30 - 60),
+ * this Strategy will attempt to rewrite those files into File A' (x: 0-20), File B' (x: 21 - 40),
+ * File C' (x: 41 - 60).
+ * <p>
+ * Currently the there is no clustering detection and we will rewrite all files if {@link SortStrategy#REWRITE_ALL}
+ * is true (default: false). If this property is disabled any files that would be chosen by
+ * {@link BinPackStrategy} will be rewrite candidates.

Review comment:
       Cool, I am good with having it later




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #2609: Adds SortRewriteStrategy

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #2609:
URL: https://github.com/apache/iceberg/pull/2609#issuecomment-879324993


   @aokolnychyi  Just updated this, I think this is the next step towards getting our Sort Implementation in as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer merged pull request #2609: Adds SortRewriteStrategy

Posted by GitBox <gi...@apache.org>.
RussellSpitzer merged pull request #2609:
URL: https://github.com/apache/iceberg/pull/2609


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #2609: Adds SortRewriteStrategy

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #2609:
URL: https://github.com/apache/iceberg/pull/2609#issuecomment-843649514


   @jackye1995 + @openinx - Sort Based PR (Abstract Strategy Only)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2609: Adds SortRewriteStrategy

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2609:
URL: https://github.com/apache/iceberg/pull/2609#discussion_r669821559



##########
File path: core/src/main/java/org/apache/iceberg/actions/SortStrategy.java
##########
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+import java.util.Set;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.SortOrder;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.PropertyUtil;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A rewrite strategy for data files which aims to reorder data with data files to optimally lay them out
+ * in relation to a column. For example, if the Sort strategy is used on a set of files which is ordered
+ * by column x and original has files File A (x: 0 - 50), File B ( x: 10 - 40) and File C ( x: 30 - 60),
+ * this Strategy will attempt to rewrite those files into File A' (x: 0-20), File B' (x: 21 - 40),
+ * File C' (x: 41 - 60).
+ * <p>
+ * Currently the there is no clustering detection and we will rewrite all files if {@link SortStrategy#REWRITE_ALL}

Review comment:
       I was thinking (2)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2609: Adds SortRewriteStrategy

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2609:
URL: https://github.com/apache/iceberg/pull/2609#discussion_r669928917



##########
File path: core/src/main/java/org/apache/iceberg/actions/SortStrategy.java
##########
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+import java.util.Set;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.SortOrder;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.PropertyUtil;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A rewrite strategy for data files which aims to reorder data with data files to optimally lay them out
+ * in relation to a column. For example, if the Sort strategy is used on a set of files which is ordered
+ * by column x and original has files File A (x: 0 - 50), File B ( x: 10 - 40) and File C ( x: 30 - 60),
+ * this Strategy will attempt to rewrite those files into File A' (x: 0-20), File B' (x: 21 - 40),
+ * File C' (x: 41 - 60).
+ * <p>
+ * Currently the there is no clustering detection and we will rewrite all files if {@link SortStrategy#REWRITE_ALL}

Review comment:
       I see, that makes much more sense, we should probably make it clear in the documentation. Apart from that I am good with this!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2609: Adds SortRewriteStrategy

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2609:
URL: https://github.com/apache/iceberg/pull/2609#discussion_r669888308



##########
File path: core/src/test/java/org/apache/iceberg/actions/TestSortStrategy.java
##########
@@ -0,0 +1,142 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+
+import java.util.Collections;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.IntStream;
+import org.apache.iceberg.AssertHelpers;
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.MockFileScanTask;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.SortOrder;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TableTestBase;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.types.Types;
+import org.junit.Assert;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+@RunWith(Parameterized.class)
+public class TestSortStrategy extends TableTestBase {
+
+  @Parameterized.Parameters(name = "formatVersion = {0}")
+  public static Object[] parameters() {
+    return new Object[] {2}; // We don't actually use the format version since everything is mock
+  }
+
+  @Override
+  public void setupTable() throws Exception {
+    super.setupTable();
+    table.replaceSortOrder().asc("data").commit();
+  }
+
+  private static final long MB = 1024 * 1024;
+
+  public TestSortStrategy(int formatVersion) {
+    super(formatVersion);
+  }
+
+  class TestSortStrategyImpl extends SortStrategy {
+
+    @Override
+    public Table table() {
+      return table;
+    }
+
+    @Override
+    public Set<DataFile> rewriteFiles(List<FileScanTask> filesToRewrite) {
+      throw new UnsupportedOperationException();
+    }
+  }
+
+  private SortStrategy defaultSort() {
+    return (SortStrategy) new TestSortStrategyImpl().options(Collections.emptyMap());
+  }
+
+  private List<FileScanTask> tasksForSortOrder(int sortOrderId, int... fileSizesMB) {
+    ImmutableList.Builder<FileScanTask> files = ImmutableList.builder();
+    IntStream.of(fileSizesMB).forEach(length -> files.add(MockFileScanTask.mockTask(length * MB, sortOrderId)));
+    return files.build();
+  }
+
+  @Test
+  public void testInvalidSortOrder() {
+    AssertHelpers.assertThrows("Should not allow an unsorted Sort order", IllegalArgumentException.class,
+        () -> defaultSort().sortOrder(SortOrder.unsorted()).options(Collections.emptyMap()));
+
+    AssertHelpers.assertThrows("Should not allow a Sort order with bad columns", ValidationException.class,
+        () -> {
+          Schema badSchema = new Schema(
+              ImmutableList.of(Types.NestedField.required(0, "nonexistant", Types.IntegerType.get())));
+
+          defaultSort()
+              .sortOrder(SortOrder.builderFor(badSchema).asc("nonexistant").build())
+              .options(Collections.emptyMap());
+        });
+  }
+
+  @Test
+  public void testSelectAll() {
+    List<FileScanTask> invalid = ImmutableList.<FileScanTask>builder()
+        .addAll(tasksForSortOrder(-1, 500, 500, 500, 500))

Review comment:
       Currently it doesn't matter because of the note I wrote below, since we normally do not set this value on our writers




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #2609: Adds SortRewriteStrategy

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #2609:
URL: https://github.com/apache/iceberg/pull/2609#issuecomment-884203945


   Thanks everyone for the review! Please check out the implementation PR as well #2829 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] SreeramGarlapati commented on a change in pull request #2609: Adds SortRewriteStrategy

Posted by GitBox <gi...@apache.org>.
SreeramGarlapati commented on a change in pull request #2609:
URL: https://github.com/apache/iceberg/pull/2609#discussion_r634843808



##########
File path: core/src/test/java/org/apache/iceberg/actions/TestSortStrategy.java
##########
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+
+import java.util.Collections;
+import java.util.List;
+import java.util.stream.IntStream;
+import org.apache.iceberg.AssertHelpers;
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.MockFileScanTask;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.SortOrder;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TableTestBase;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.types.Types;
+import org.junit.Assert;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+@RunWith(Parameterized.class)
+public class TestSortStrategy extends TableTestBase {
+
+  @Parameterized.Parameters(name = "formatVersion = {0}")
+  public static Object[] parameters() {
+    return new Object[] {2}; // We don't actually use the format version since everything is mock
+  }
+
+  @Override
+  public void setupTable() throws Exception {
+    super.setupTable();
+    table.replaceSortOrder().asc("data").commit();
+  }
+
+  private static final long MB = 1024 * 1024;
+
+  public TestSortStrategy(int formatVersion) {
+    super(formatVersion);
+  }
+
+  class TestSortStrategyImpl extends SortStrategy {
+
+    @Override
+    public Table table() {
+      return table;
+    }
+
+    @Override
+    public List<DataFile> rewriteFiles(List<FileScanTask> filesToRewrite) {
+      throw new UnsupportedOperationException();
+    }
+  }
+
+  private SortStrategy defaultSort() {
+    return (SortStrategy) new TestSortStrategyImpl().options(Collections.emptyMap());
+  }
+
+  private List<FileScanTask> tasksForSortOrder(int sortOrderId, int... fileSizesMB) {
+    ImmutableList.Builder<FileScanTask> files = ImmutableList.builder();
+    IntStream.of(fileSizesMB).forEach(length -> files.add(MockFileScanTask.mockTask(length * MB, sortOrderId)));
+    return files.build();
+  }
+
+  @Test
+  public void testInvalidSortOrder() {
+    AssertHelpers.assertThrows("Should not allow an unsorted Sort order", IllegalArgumentException.class,
+        () -> defaultSort().sortOrder(SortOrder.unsorted()).options(Collections.emptyMap()));
+
+    AssertHelpers.assertThrows("Should not allow a Sort order with bad columns", ValidationException.class,
+        () -> {
+          Schema badSchema = new Schema(
+              ImmutableList.of(Types.NestedField.required(0, "nonexistant", Types.IntegerType.get())));
+
+          defaultSort()
+              .sortOrder(SortOrder.builderFor(badSchema).asc("nonexistant").build())
+              .options(Collections.emptyMap());
+        });
+  }
+
+  @Test
+  public void testSelectWrongSortOrder() {
+    List<FileScanTask> expected = tasksForSortOrder(-1, 500, 500, 500, 500);
+    RewriteStrategy strategy = defaultSort().options(Collections.emptyMap());
+    List<FileScanTask> actual = ImmutableList.copyOf(strategy.selectFilesToRewrite(expected));
+
+    Assert.assertEquals("Should mark all files for rewrite if they have the wrong sort order",
+        expected, actual);
+  }
+
+  @Test
+  public void testSelectCorrectSortOrder() {
+    List<FileScanTask> fileScanTasks = tasksForSortOrder(table.sortOrder().orderId(), 500, 500, 500, 500);
+    RewriteStrategy strategy = defaultSort().options(Collections.emptyMap());
+    List<FileScanTask> actual = ImmutableList.copyOf(strategy.selectFilesToRewrite(fileScanTasks));
+
+    Assert.assertEquals("Should mark no files for rewrite if they have a good size and the right sort order",
+        Collections.emptyList(), actual);
+  }
+
+  @Test
+  public void testSelectMixedOrderMixedSize() {
+    List<FileScanTask> expected = ImmutableList.<FileScanTask>builder()
+        .addAll(tasksForSortOrder(-1, 500, 500, 500, 500))
+        .addAll(tasksForSortOrder(table.sortOrder().orderId(), 10, 10, 2000, 10))
+        .build();
+
+    List<FileScanTask> fileScanTasks = ImmutableList.<FileScanTask>builder()
+        .addAll(expected)
+        .addAll(tasksForSortOrder(table.sortOrder().orderId(), 500, 490, 520))
+        .build();
+
+    RewriteStrategy strategy = defaultSort().options(Collections.emptyMap());
+    List<FileScanTask> actual = ImmutableList.copyOf(strategy.selectFilesToRewrite(fileScanTasks));
+
+    Assert.assertEquals("Should mark files for rewrite with invalid size and bad sort order",

Review comment:
       nit: `Should mark files for rewrite with invalid sort order id and mixed sizes`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #2609: Adds SortRewriteStrategy

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on a change in pull request #2609:
URL: https://github.com/apache/iceberg/pull/2609#discussion_r670774148



##########
File path: core/src/main/java/org/apache/iceberg/actions/SortStrategy.java
##########
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+import java.util.Set;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.SortOrder;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.PropertyUtil;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A rewrite strategy for data files which aims to reorder data with data files to optimally lay them out
+ * in relation to a column. For example, if the Sort strategy is used on a set of files which is ordered
+ * by column x and original has files File A (x: 0 - 50), File B ( x: 10 - 40) and File C ( x: 30 - 60),
+ * this Strategy will attempt to rewrite those files into File A' (x: 0-20), File B' (x: 21 - 40),
+ * File C' (x: 41 - 60).
+ * <p>
+ * Currently the there is no clustering detection and we will rewrite all files if {@link SortStrategy#REWRITE_ALL}

Review comment:
       Fixed, switched to "file-overlap" instead of "clustering"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2609: Adds SortRewriteStrategy

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2609:
URL: https://github.com/apache/iceberg/pull/2609#discussion_r669820275



##########
File path: core/src/main/java/org/apache/iceberg/actions/SortStrategy.java
##########
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+import java.util.Set;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.SortOrder;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.PropertyUtil;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A rewrite strategy for data files which aims to reorder data with data files to optimally lay them out
+ * in relation to a column. For example, if the Sort strategy is used on a set of files which is ordered
+ * by column x and original has files File A (x: 0 - 50), File B ( x: 10 - 40) and File C ( x: 30 - 60),
+ * this Strategy will attempt to rewrite those files into File A' (x: 0-20), File B' (x: 21 - 40),
+ * File C' (x: 41 - 60).
+ * <p>
+ * Currently the there is no clustering detection and we will rewrite all files if {@link SortStrategy#REWRITE_ALL}

Review comment:
       Have we thought about how we will support clustering in the future? Is it going to be:
   1. another strategy that extends this `SortStrategy`
   2. additional configurations in this strategy
   3. or something else?
   
   I am just wondering this approach for extending `BinPackStrategy` is extensible in the long run as different strategies might have inter-dependencies with each other. If this is just one config or maybe plus a few additional configs for clustering in the future, would it be better to just directly make them configurations in the `BinPackStrategy`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer edited a comment on pull request #2609: Adds SortRewriteStrategy

Posted by GitBox <gi...@apache.org>.
RussellSpitzer edited a comment on pull request #2609:
URL: https://github.com/apache/iceberg/pull/2609#issuecomment-883428324


   Ok, if we don't have any other points I think we can get this in so I can rebase the implementation and sort order PR's. I'll wait a while to see if anyone has any last opinions and merge tonight.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2609: Adds SortRewriteStrategy

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2609:
URL: https://github.com/apache/iceberg/pull/2609#discussion_r669879435



##########
File path: core/src/test/java/org/apache/iceberg/actions/TestSortStrategy.java
##########
@@ -0,0 +1,142 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+
+import java.util.Collections;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.IntStream;
+import org.apache.iceberg.AssertHelpers;
+import org.apache.iceberg.DataFile;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.MockFileScanTask;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.SortOrder;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.TableTestBase;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.types.Types;
+import org.junit.Assert;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+@RunWith(Parameterized.class)
+public class TestSortStrategy extends TableTestBase {
+
+  @Parameterized.Parameters(name = "formatVersion = {0}")
+  public static Object[] parameters() {
+    return new Object[] {2}; // We don't actually use the format version since everything is mock
+  }
+
+  @Override
+  public void setupTable() throws Exception {
+    super.setupTable();
+    table.replaceSortOrder().asc("data").commit();
+  }
+
+  private static final long MB = 1024 * 1024;
+
+  public TestSortStrategy(int formatVersion) {
+    super(formatVersion);
+  }
+
+  class TestSortStrategyImpl extends SortStrategy {
+
+    @Override
+    public Table table() {
+      return table;
+    }
+
+    @Override
+    public Set<DataFile> rewriteFiles(List<FileScanTask> filesToRewrite) {
+      throw new UnsupportedOperationException();
+    }
+  }
+
+  private SortStrategy defaultSort() {
+    return (SortStrategy) new TestSortStrategyImpl().options(Collections.emptyMap());
+  }
+
+  private List<FileScanTask> tasksForSortOrder(int sortOrderId, int... fileSizesMB) {
+    ImmutableList.Builder<FileScanTask> files = ImmutableList.builder();
+    IntStream.of(fileSizesMB).forEach(length -> files.add(MockFileScanTask.mockTask(length * MB, sortOrderId)));
+    return files.build();
+  }
+
+  @Test
+  public void testInvalidSortOrder() {
+    AssertHelpers.assertThrows("Should not allow an unsorted Sort order", IllegalArgumentException.class,
+        () -> defaultSort().sortOrder(SortOrder.unsorted()).options(Collections.emptyMap()));
+
+    AssertHelpers.assertThrows("Should not allow a Sort order with bad columns", ValidationException.class,
+        () -> {
+          Schema badSchema = new Schema(
+              ImmutableList.of(Types.NestedField.required(0, "nonexistant", Types.IntegerType.get())));
+
+          defaultSort()
+              .sortOrder(SortOrder.builderFor(badSchema).asc("nonexistant").build())
+              .options(Collections.emptyMap());
+        });
+  }
+
+  @Test
+  public void testSelectAll() {
+    List<FileScanTask> invalid = ImmutableList.<FileScanTask>builder()
+        .addAll(tasksForSortOrder(-1, 500, 500, 500, 500))

Review comment:
       what is order id -1? If it's default UNSORTED_ORDER, it should be 0.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #2609: Adds SortRewriteStrategy

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on pull request #2609:
URL: https://github.com/apache/iceberg/pull/2609#issuecomment-883428324


   Ok, if we don't have any other points I think we can get this in. So I can rebase the implementation and sort order PR's. I'll wait a while to see if anyone has any last opinions and merge tonight.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] jackye1995 commented on a change in pull request #2609: Adds SortRewriteStrategy

Posted by GitBox <gi...@apache.org>.
jackye1995 commented on a change in pull request #2609:
URL: https://github.com/apache/iceberg/pull/2609#discussion_r669881103



##########
File path: core/src/main/java/org/apache/iceberg/actions/SortStrategy.java
##########
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+import java.util.Set;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.SortOrder;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.PropertyUtil;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A rewrite strategy for data files which aims to reorder data with data files to optimally lay them out
+ * in relation to a column. For example, if the Sort strategy is used on a set of files which is ordered
+ * by column x and original has files File A (x: 0 - 50), File B ( x: 10 - 40) and File C ( x: 30 - 60),
+ * this Strategy will attempt to rewrite those files into File A' (x: 0-20), File B' (x: 21 - 40),
+ * File C' (x: 41 - 60).
+ * <p>
+ * Currently the there is no clustering detection and we will rewrite all files if {@link SortStrategy#REWRITE_ALL}
+ * is true (default: false). If this property is disabled any files that would be chosen by
+ * {@link BinPackStrategy} will be rewrite candidates.

Review comment:
       I thought it would be any files that would be chosen by BinPackStrategy + any file with the wrong sort oder. Why is part 2 excluded?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on a change in pull request #2609: Adds SortRewriteStrategy

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on a change in pull request #2609:
URL: https://github.com/apache/iceberg/pull/2609#discussion_r635589297



##########
File path: core/src/main/java/org/apache/iceberg/actions/SortStrategy.java
##########
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.actions;
+
+import java.util.Map;
+import java.util.Map.Entry;
+import java.util.Optional;
+import java.util.Set;
+import org.apache.iceberg.FileScanTask;
+import org.apache.iceberg.SortOrder;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.FluentIterable;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet;
+import org.apache.iceberg.util.PropertyUtil;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * A rewrite strategy for data files which aims to reorder data with data files to optimally lay them out
+ * in relation to a column. For example, if the Sort strategy is used on a set of files which is ordered
+ * by column x and original has files File A (x: 0 - 50), File B ( x: 10 - 40) and File C ( x: 30 - 60),
+ * this Strategy will attempt to rewrite those files into File A' (x: 0-20), File B' (x: 21 - 40),
+ * File C' (x: 41 - 60).
+ * <p>
+ * Currently the there is no clustering detection and we will rewrite all files if {@link SortStrategy#REWRITE_ALL}
+ * is true (default: false). If this property is disabled any files with the incorrect sort-order as well as any files
+ * that would be chosen by {@link BinPackStrategy} will be rewrite candidates.
+ * <p>
+ * In the future other algorithms for determining files to rewrite will be provided.
+ */
+public abstract class SortStrategy extends BinPackStrategy {
+  private static final Logger LOG = LoggerFactory.getLogger(SortStrategy.class);
+
+  /**
+   * Rewrites all files, regardless of their size. Defaults to false, rewriting only wrong sort-order and mis-sized
+   * files;
+   */
+  public static final String REWRITE_ALL = "no-size-filter";
+  public static final boolean REWRITE_ALL_DEFAULT = false;
+
+  private static final Set<String> validOptions = ImmutableSet.of(
+      REWRITE_ALL
+  );
+
+  private boolean rewriteAll;
+  private SortOrder sortOrder;
+  private int sortOrderId = -1;
+
+  /**
+   * Sets the sort order to be used in this strategy when rewriting files
+   * @param order the order to use
+   * @return this for method chaining
+   */
+  public SortStrategy sortOrder(SortOrder order) {
+    this.sortOrder = order;
+
+    // See if this order matches any of our known orders
+    Optional<Entry<Integer, SortOrder>> knownOrder = table().sortOrders().entrySet().stream()
+        .filter(entry -> entry.getValue().sameOrder(order))
+        .findFirst();
+    knownOrder.ifPresent(entry -> sortOrderId = entry.getKey());
+
+    return this;
+  }
+
+  @Override
+  public String name() {
+    return "SORT";
+  }
+
+  @Override
+  public Set<String> validOptions() {
+    return ImmutableSet.<String>builder()
+        .addAll(super.validOptions())
+        .addAll(validOptions)
+        .build();
+  }
+
+  @Override
+  public RewriteStrategy options(Map<String, String> options) {
+    super.options(options); // Also checks validity of BinPack options
+
+    rewriteAll = PropertyUtil.propertyAsBoolean(options,
+        REWRITE_ALL,
+        REWRITE_ALL_DEFAULT);
+
+    if (sortOrder == null) {
+      sortOrder = table().sortOrder();
+      sortOrderId = sortOrder.orderId();
+    }
+
+    validateOptions();
+    return this;
+  }
+
+  @Override
+  public Iterable<FileScanTask> selectFilesToRewrite(Iterable<FileScanTask> dataFiles) {
+    if (rewriteAll) {
+      LOG.info("Sort Strategy for table {} set to rewrite all data files", table().name());
+      return dataFiles;
+    } else {
+      FluentIterable filesWithCorrectOrder =

Review comment:
       I think the current implementation as it stands today will always rewrite everything as we never currently propagate the sort order id to data files. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org