You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/09/22 07:25:34 UTC

[GitHub] [iceberg] linfey90 opened a new pull request, #5824: Spark: support hilbert curve when rewrite

linfey90 opened a new pull request, #5824:
URL: https://github.com/apache/iceberg/pull/5824

   The data aggregation of Hilbert curve is better, this pr will support.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] linfey90 commented on a diff in pull request #5824: Spark: support hilbert curve when rewrite

Posted by GitBox <gi...@apache.org>.
linfey90 commented on code in PR #5824:
URL: https://github.com/apache/iceberg/pull/5824#discussion_r1022369901


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/HilbertCurveUtils.java:
##########
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.actions;
+
+import java.math.BigInteger;
+import org.davidmoten.hilbert.HilbertCurve;
+
+/** Utils for Hilbert Curve. */
+public class HilbertCurveUtils {
+  public static byte[] indexBytes(HilbertCurve hilbertCurve, long[] points, int paddingNum) {
+    BigInteger index = hilbertCurve.index(points);
+    return paddingToNByte(index.toByteArray(), paddingNum);
+  }
+
+  private HilbertCurveUtils() {}
+
+  public static byte[] paddingToNByte(byte[] data, int paddingNum) {

Review Comment:
   I neglected, we can reuse.thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on pull request #5824: Spark: support hilbert curve when rewrite

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on PR #5824:
URL: https://github.com/apache/iceberg/pull/5824#issuecomment-1313119358

   @RussellSpitzer: Would you like to take a look at this? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] linfey90 commented on pull request #5824: Spark: support hilbert curve when rewrite

Posted by GitBox <gi...@apache.org>.
linfey90 commented on PR #5824:
URL: https://github.com/apache/iceberg/pull/5824#issuecomment-1254656452

   > The data aggregation of Hilbert curve is better, this pr will support.
   
   Can we take a look at it,thanks! @ajantha-bhat @RussellSpitzer @rdblue @kbendick


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] minjing0824 commented on pull request #5824: Spark: support hilbert curve when rewrite

Posted by GitBox <gi...@apache.org>.
minjing0824 commented on PR #5824:
URL: https://github.com/apache/iceberg/pull/5824#issuecomment-1328098037

   Hi, @linfey90 I have a question want to know, here how I use zorder in my code
   `actions().rewriteDataFiles(table).zOrder(columns).option("xxx","xxx")`
   if you add hilbert, it only to change zOrder to hilbert to use it like
   `actions().rewriteDataFiles(table).HilbertCurve(columns).options("xxx","xxx")` or something else? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #5824: Spark: support hilbert curve when rewrite

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on code in PR #5824:
URL: https://github.com/apache/iceberg/pull/5824#discussion_r1021624111


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/SparkSpaceCurveUDF.java:
##########
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.actions;
+
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.types.DataType;
+
+public interface SparkSpaceCurveUDF {
+
+  Column sortedLexicographically(Column column, DataType type);
+
+  Column interleaveBytes(Column arrayBinary);

Review Comment:
   This I think is a little incorrect, the ZOrder algorithm interleaves the Hilbert doesn't quite do that. Maybe this should just be combineColumns or something like that?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] XuQianJin-Stars commented on pull request #5824: Spark: support hilbert curve when rewrite

Posted by GitBox <gi...@apache.org>.
XuQianJin-Stars commented on PR #5824:
URL: https://github.com/apache/iceberg/pull/5824#issuecomment-1255722391

   hi @linfey90 Overall implementation is good.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #5824: Spark: support hilbert curve when rewrite

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on code in PR #5824:
URL: https://github.com/apache/iceberg/pull/5824#discussion_r1021617890


##########
spark/v3.3/build.gradle:
##########
@@ -39,6 +39,7 @@ project(":iceberg-spark:iceberg-spark-${sparkMajorVersion}_${scalaVersion}") {
   }
 
   dependencies {
+    implementation group: 'com.github.davidmoten', name: 'hilbert-curve', version: '0.2.2'

Review Comment:
   What is the licensing info on this library? Just want to make sure it is allowed to be included in the project?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] linfey90 commented on a diff in pull request #5824: Spark: support hilbert curve when rewrite

Posted by GitBox <gi...@apache.org>.
linfey90 commented on code in PR #5824:
URL: https://github.com/apache/iceberg/pull/5824#discussion_r1024671032


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/SparkSpaceCurveUDF.java:
##########
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.actions;
+
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.types.DataType;
+
+public interface SparkSpaceCurveUDF {
+
+  Column sortedLexicographically(Column column, DataType type);
+
+  Column interleaveBytes(Column arrayBinary);

Review Comment:
   thanks.I'll take a closer look later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] linfey90 commented on a diff in pull request #5824: Spark: support hilbert curve when rewrite

Posted by GitBox <gi...@apache.org>.
linfey90 commented on code in PR #5824:
URL: https://github.com/apache/iceberg/pull/5824#discussion_r1022369675


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/SparkSpaceCurveUDF.java:
##########
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.actions;
+
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.types.DataType;
+
+public interface SparkSpaceCurveUDF {
+
+  Column sortedLexicographically(Column column, DataType type);
+
+  Column interleaveBytes(Column arrayBinary);

Review Comment:
   Thank you, but I don't understand the meaning here. I think the hilbert value is generated, isn't it hilbert curve?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] linfey90 commented on pull request #5824: Spark: support hilbert curve when rewrite

Posted by GitBox <gi...@apache.org>.
linfey90 commented on PR #5824:
URL: https://github.com/apache/iceberg/pull/5824#issuecomment-1314840245

   > 
   Ok, I'll try to do that later. Thank you for your reply.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] linfey90 commented on a diff in pull request #5824: Spark: support hilbert curve when rewrite

Posted by GitBox <gi...@apache.org>.
linfey90 commented on code in PR #5824:
URL: https://github.com/apache/iceberg/pull/5824#discussion_r1037956544


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/SparkSpaceCurveUDF.java:
##########
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.actions;
+
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.types.DataType;
+
+public interface SparkSpaceCurveUDF {
+
+  Column sortedLexicographically(Column column, DataType type);
+
+  Column interleaveBytes(Column arrayBinary);

Review Comment:
   > For ZOrder bytes are directly interleaved, AAAA and BBBB -> ABABABA
   > 
   > For Hilbert it's give coordinates AAAA and BBBB what is the corresponding hilbert curve location given a hilbert curve of X bit depth. So that isn't an interleaving.
   
   I take a look, this method name caused a misunderstanding, in fact, it did not interleaving for hilbert.I'll update it;



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] linfey90 commented on pull request #5824: Spark: support hilbert curve when rewrite

Posted by GitBox <gi...@apache.org>.
linfey90 commented on PR #5824:
URL: https://github.com/apache/iceberg/pull/5824#issuecomment-1331577709

   > 
   
   yes,that's right.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #5824: Spark: support hilbert curve when rewrite

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on code in PR #5824:
URL: https://github.com/apache/iceberg/pull/5824#discussion_r1023195403


##########
spark/v3.3/build.gradle:
##########
@@ -39,6 +39,7 @@ project(":iceberg-spark:iceberg-spark-${sparkMajorVersion}_${scalaVersion}") {
   }
 
   dependencies {
+    implementation group: 'com.github.davidmoten', name: 'hilbert-curve', version: '0.2.2'

Review Comment:
   Maybe that's ok then? I just want to make sure we do it right.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #5824: Spark: support hilbert curve when rewrite

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on code in PR #5824:
URL: https://github.com/apache/iceberg/pull/5824#discussion_r1021091018


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/SparkSpaceCurveStrategy.java:
##########
@@ -57,8 +57,8 @@
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
-public class SparkZOrderStrategy extends SparkSortStrategy {

Review Comment:
   Maybe we cannot rename the public interfaces. It may break compatibility.
   We need to deprecate first and provide an alternate implementation. Later remove the deprecated class in the next version.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #5824: Spark: support hilbert curve when rewrite

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on code in PR #5824:
URL: https://github.com/apache/iceberg/pull/5824#discussion_r1021629682


##########
spark/v3.3/build.gradle:
##########
@@ -39,6 +39,7 @@ project(":iceberg-spark:iceberg-spark-${sparkMajorVersion}_${scalaVersion}") {
   }
 
   dependencies {
+    implementation group: 'com.github.davidmoten', name: 'hilbert-curve', version: '0.2.2'

Review Comment:
   I checked this while reviewing. It is Apache-2.0 and is compatible with the Iceberg project licensing.
   https://github.com/davidmoten/hilbert-curve



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] linfey90 commented on pull request #5824: Spark: support hilbert curve when rewrite

Posted by GitBox <gi...@apache.org>.
linfey90 commented on PR #5824:
URL: https://github.com/apache/iceberg/pull/5824#issuecomment-1366402542

   Hilbert  has better data aggregation.Here is a simple performance test.
   1: prepare a parquet table which has One hundred million rows, and 11 columns.
   and has two column name c1 and c2.the values is range from 0 to 500000.
   the flinksql like,
   CREATE TABLE default_catalog.default_database.dg (
       c1 INT,
   	c2 bigint
   	c3 VARCHAR,
   	c4 VARCHAR,
   	c5 TINYINT,
   	c6 SMALLINT,
   	c7 FLOAT,
   	c8 double,
   	c9 char,
   	c10 boolean,
   	c11 AS localtimestamp
   ) WITH (
       'connector' = 'datagen',
   	'fields.c3.length' = '10',
   	'fields.c4.length' = '10',
   	'fields.c1.min' = '0',
   	'fields.c1.max' = '1000000',
   	'fields.c2.min' = '0',
   	'fields.c2.max' = '1000000',
       'rows-per-second' = '30000',
   	'number-of-rows' ='100000000'
   );
   2: Create two tables, test_zorder and test_hilbert, and copy the above data.
   3: rewrite the table by sort c1,c2 with zorder and hilbert.
   4: Write code to view the number of file skips, and execute the sql like select count.
   |query condition         | table           | file skip  | total Files | file Skip percentage  | query time |
   | ------------- |:-------------:| -----:| -----:| -----:| -----:|
   | c1 <500000 and c2 < 500000      | hilbert | 97 | 171 | 56.7% | 1.018s |
   |       | zorder      |   82 | 180 | 45.56% | 1.353s |
   | c1 >500000 and c2 > 500000 | hilbert | 28 | 171 | 16.37% | 3.337s |
   |  | zorder | 18 | 180 | 10% | 3.37s |
   
   note:The query time depends on the cluster environment and is for reference only. But file skip is stable.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] linfey90 commented on a diff in pull request #5824: Spark: support hilbert curve when rewrite

Posted by GitBox <gi...@apache.org>.
linfey90 commented on code in PR #5824:
URL: https://github.com/apache/iceberg/pull/5824#discussion_r1022346696


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/SparkSpaceCurveStrategy.java:
##########
@@ -57,8 +57,8 @@
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
-public class SparkZOrderStrategy extends SparkSortStrategy {

Review Comment:
   get it,thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #5824: Spark: support hilbert curve when rewrite

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on code in PR #5824:
URL: https://github.com/apache/iceberg/pull/5824#discussion_r1021622062


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/SparkHilbertUDF.java:
##########
@@ -0,0 +1,243 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.actions;
+
+import java.io.IOException;
+import java.io.ObjectInputStream;
+import java.io.Serializable;
+import java.math.BigDecimal;
+import org.apache.iceberg.util.BinaryUtil;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.expressions.UserDefinedFunction;
+import org.apache.spark.sql.functions;
+import org.apache.spark.sql.types.BinaryType;
+import org.apache.spark.sql.types.BooleanType;
+import org.apache.spark.sql.types.ByteType;
+import org.apache.spark.sql.types.DataType;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.DateType;
+import org.apache.spark.sql.types.DecimalType;
+import org.apache.spark.sql.types.DoubleType;
+import org.apache.spark.sql.types.FloatType;
+import org.apache.spark.sql.types.IntegerType;
+import org.apache.spark.sql.types.LongType;
+import org.apache.spark.sql.types.ShortType;
+import org.apache.spark.sql.types.StringType;
+import org.apache.spark.sql.types.TimestampType;
+import org.davidmoten.hilbert.HilbertCurve;
+import scala.collection.JavaConverters;
+import scala.collection.Seq;
+
+class SparkHilbertUDF implements SparkSpaceCurveUDF, Serializable {
+  private static final long PRIMITIVE_EMPTY = Long.MAX_VALUE;
+
+  private static final int BITS_NUM = 63;
+
+  SparkHilbertUDF() {}
+
+  private void readObject(ObjectInputStream in) throws IOException, ClassNotFoundException {
+    in.defaultReadObject();
+  }
+
+  byte[] hilbertCurvePosBytes(Seq<Long> points) {
+    java.util.List<Long> longs = JavaConverters.seqAsJavaList(points);
+    long[] longs1 = new long[longs.size()];
+    for (int i = 0; i < longs.size(); i++) {
+      longs1[i] = longs.get(i);
+    }
+    HilbertCurve hilbertCurve = HilbertCurve.bits(BITS_NUM).dimensions(points.size());
+    return HilbertCurveUtils.indexBytes(hilbertCurve, longs1, BITS_NUM);
+  }
+
+  private UserDefinedFunction tinyToOrderedLongUDF() {
+    UserDefinedFunction udf =
+        functions
+            .udf(
+                (Byte value) -> {
+                  if (value == null) {
+                    return PRIMITIVE_EMPTY;
+                  }
+                  return BinaryUtil.convertBytesToLong(new byte[] {value});
+                },
+                DataTypes.LongType)
+            .withName("TINY_ORDERED_BYTES");
+
+    return udf;
+  }
+
+  private UserDefinedFunction shortToOrderedLongUDF() {
+    UserDefinedFunction udf =
+        functions
+            .udf(
+                (Short value) -> {
+                  if (value == null) {
+                    return PRIMITIVE_EMPTY;
+                  }
+                  return (long) value;
+                },
+                DataTypes.LongType)
+            .withName("SHORT_ORDERED_BYTES");
+
+    return udf;
+  }
+
+  private UserDefinedFunction intToOrderedLongUDF() {
+    UserDefinedFunction udf =
+        functions
+            .udf(
+                (Integer value) -> {
+                  if (value == null) {
+                    return PRIMITIVE_EMPTY;
+                  }
+                  return (long) value;
+                },
+                DataTypes.LongType)
+            .withName("INT_ORDERED_BYTES");
+
+    return udf;
+  }
+
+  private UserDefinedFunction longToOrderedLongUDF() {

Review Comment:
   These are all the same as the Zorder functions correct? Can we unify the byte transformation udfs?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #5824: Spark: support hilbert curve when rewrite

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on code in PR #5824:
URL: https://github.com/apache/iceberg/pull/5824#discussion_r1021619347


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/HilbertCurveUtils.java:
##########
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.actions;
+
+import java.math.BigInteger;
+import org.davidmoten.hilbert.HilbertCurve;
+
+/** Utils for Hilbert Curve. */
+public class HilbertCurveUtils {
+  public static byte[] indexBytes(HilbertCurve hilbertCurve, long[] points, int paddingNum) {
+    BigInteger index = hilbertCurve.index(points);
+    return paddingToNByte(index.toByteArray(), paddingNum);
+  }
+
+  private HilbertCurveUtils() {}
+
+  public static byte[] paddingToNByte(byte[] data, int paddingNum) {

Review Comment:
   There is the specific 8 byte implementation above, do we need both?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] linfey90 commented on a diff in pull request #5824: Spark: support hilbert curve when rewrite

Posted by GitBox <gi...@apache.org>.
linfey90 commented on code in PR #5824:
URL: https://github.com/apache/iceberg/pull/5824#discussion_r1022369313


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/SparkHilbertUDF.java:
##########
@@ -0,0 +1,243 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.actions;
+
+import java.io.IOException;
+import java.io.ObjectInputStream;
+import java.io.Serializable;
+import java.math.BigDecimal;
+import org.apache.iceberg.util.BinaryUtil;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.expressions.UserDefinedFunction;
+import org.apache.spark.sql.functions;
+import org.apache.spark.sql.types.BinaryType;
+import org.apache.spark.sql.types.BooleanType;
+import org.apache.spark.sql.types.ByteType;
+import org.apache.spark.sql.types.DataType;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.DateType;
+import org.apache.spark.sql.types.DecimalType;
+import org.apache.spark.sql.types.DoubleType;
+import org.apache.spark.sql.types.FloatType;
+import org.apache.spark.sql.types.IntegerType;
+import org.apache.spark.sql.types.LongType;
+import org.apache.spark.sql.types.ShortType;
+import org.apache.spark.sql.types.StringType;
+import org.apache.spark.sql.types.TimestampType;
+import org.davidmoten.hilbert.HilbertCurve;
+import scala.collection.JavaConverters;
+import scala.collection.Seq;
+
+class SparkHilbertUDF implements SparkSpaceCurveUDF, Serializable {
+  private static final long PRIMITIVE_EMPTY = Long.MAX_VALUE;
+
+  private static final int BITS_NUM = 63;
+
+  SparkHilbertUDF() {}
+
+  private void readObject(ObjectInputStream in) throws IOException, ClassNotFoundException {
+    in.defaultReadObject();
+  }
+
+  byte[] hilbertCurvePosBytes(Seq<Long> points) {
+    java.util.List<Long> longs = JavaConverters.seqAsJavaList(points);
+    long[] longs1 = new long[longs.size()];
+    for (int i = 0; i < longs.size(); i++) {
+      longs1[i] = longs.get(i);
+    }
+    HilbertCurve hilbertCurve = HilbertCurve.bits(BITS_NUM).dimensions(points.size());
+    return HilbertCurveUtils.indexBytes(hilbertCurve, longs1, BITS_NUM);
+  }
+
+  private UserDefinedFunction tinyToOrderedLongUDF() {
+    UserDefinedFunction udf =
+        functions
+            .udf(
+                (Byte value) -> {
+                  if (value == null) {
+                    return PRIMITIVE_EMPTY;
+                  }
+                  return BinaryUtil.convertBytesToLong(new byte[] {value});
+                },
+                DataTypes.LongType)
+            .withName("TINY_ORDERED_BYTES");
+
+    return udf;
+  }
+
+  private UserDefinedFunction shortToOrderedLongUDF() {
+    UserDefinedFunction udf =
+        functions
+            .udf(
+                (Short value) -> {
+                  if (value == null) {
+                    return PRIMITIVE_EMPTY;
+                  }
+                  return (long) value;
+                },
+                DataTypes.LongType)
+            .withName("SHORT_ORDERED_BYTES");
+
+    return udf;
+  }
+
+  private UserDefinedFunction intToOrderedLongUDF() {
+    UserDefinedFunction udf =
+        functions
+            .udf(
+                (Integer value) -> {
+                  if (value == null) {
+                    return PRIMITIVE_EMPTY;
+                  }
+                  return (long) value;
+                },
+                DataTypes.LongType)
+            .withName("INT_ORDERED_BYTES");
+
+    return udf;
+  }
+
+  private UserDefinedFunction longToOrderedLongUDF() {

Review Comment:
   yes,we can, I thought if I converted to bytes, I'd have to go back to long.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] ajantha-bhat commented on a diff in pull request #5824: Spark: support hilbert curve when rewrite

Posted by GitBox <gi...@apache.org>.
ajantha-bhat commented on code in PR #5824:
URL: https://github.com/apache/iceberg/pull/5824#discussion_r1022971182


##########
spark/v3.3/build.gradle:
##########
@@ -39,6 +39,7 @@ project(":iceberg-spark:iceberg-spark-${sparkMajorVersion}_${scalaVersion}") {
   }
 
   dependencies {
+    implementation group: 'com.github.davidmoten', name: 'hilbert-curve', version: '0.2.2'

Review Comment:
   Interesting. TIL.
   
   I thought only if we keep their source code in our code, we need to add in [LICENCE](https://github.com/apache/iceberg/blob/master/LICENSE)
   
   We do depend on a lot of third-party libraries and I don't see them in LICENSE. Do you know why?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #5824: Spark: support hilbert curve when rewrite

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on code in PR #5824:
URL: https://github.com/apache/iceberg/pull/5824#discussion_r1022950032


##########
spark/v3.3/build.gradle:
##########
@@ -39,6 +39,7 @@ project(":iceberg-spark:iceberg-spark-${sparkMajorVersion}_${scalaVersion}") {
   }
 
   dependencies {
+    implementation group: 'com.github.davidmoten', name: 'hilbert-curve', version: '0.2.2'

Review Comment:
   It needs to be added to LICENSE



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #5824: Spark: support hilbert curve when rewrite

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on code in PR #5824:
URL: https://github.com/apache/iceberg/pull/5824#discussion_r1023194843


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/actions/SparkSpaceCurveUDF.java:
##########
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.spark.actions;
+
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.types.DataType;
+
+public interface SparkSpaceCurveUDF {
+
+  Column sortedLexicographically(Column column, DataType type);
+
+  Column interleaveBytes(Column arrayBinary);

Review Comment:
   For ZOrder bytes are directly interleaved, AAAA and BBBB -> ABABABA
   
   For Hilbert it's give coordinates AAAA and BBBB what is the corresponding hilbert curve location given a hilbert curve of X bit depth. So that isn't an interleaving. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org