You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/11/26 18:26:46 UTC

[GitHub] [hudi] yihua commented on a change in pull request #3952: [HUDI-2102]support hilbert curve for hudi.

yihua commented on a change in pull request #3952:
URL: https://github.com/apache/hudi/pull/3952#discussion_r757629889



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/optimize/HilbertCurve.java
##########
@@ -0,0 +1,290 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.optimize;
+
+import java.math.BigInteger;
+import java.util.Arrays;
+
+/**
+ * Converts between Hilbert index ({@code BigInteger}) and N-dimensional points.
+ *
+ * Note:
+ * <a href="https://github.com/davidmoten/hilbert-curve/blob/master/src/main/java/org/davidmoten/hilbert/HilbertCurve.java">GitHub</a>).
+ * the Licensed of above link is also http://www.apache.org/licenses/LICENSE-2.0
+ */
+public final class HilbertCurve {

Review comment:
       Is this class copied from https://github.com/davidmoten/hilbert-curve/blob/master/src/main/java/org/davidmoten/hilbert/HilbertCurve.java?  Could we just add that library as a dependency and have a wrapper class around it if needed?
   ```
   <dependency>
       <groupId>com.github.davidmoten</groupId>
       <artifactId>hilbert-curve</artifactId>
       <version>VERSION_HERE</version>
   </dependency>
   ```

##########
File path: hudi-client/hudi-client-common/src/test/java/org/apache/hudi/optimize/TestZOrderingUtil.java
##########
@@ -126,4 +126,21 @@ public OrginValueWrapper(T index, T originValue) {
       this.originValue = originValue;
     }
   }
+
+  @Test
+  public void testConvertBytesToLong() {

Review comment:
       Could you add another test for the cases when the length of the byte array passed to `convertLongToBytes()` is not 8, where padding logic is incurred?

##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/optimize/ZOrderingUtil.java
##########
@@ -176,9 +176,17 @@ public static byte updatePos(byte a, int apos, byte b, int bpos) {
 
   public static Long convertStringToLong(String a) {
     byte[] bytes = utf8To8Byte(a);
+    return convertBytesToLong(bytes);
+  }
+
+  public static long convertBytesToLong(byte[] bytes) {
+    byte[] padBytes = bytes;

Review comment:
       nit: can be named as `paddedBytes`

##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/spark/SpaceCurveOptimizeHelper.java
##########
@@ -67,40 +69,62 @@
 import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.Collection;
+import java.util.Iterator;
 import java.util.List;
 import java.util.Map;
 import java.util.stream.Collectors;
 
-public class ZCurveOptimizeHelper {
+public class SpaceCurveOptimizeHelper {
 
   private static final String SPARK_JOB_DESCRIPTION = "spark.job.description";
 
   /**
-   * Create z-order DataFrame directly
-   * first, map all base type data to byte[8], then create z-order DataFrame
+   * Create optimized DataFrame directly
    * only support base type data. long,int,short,double,float,string,timestamp,decimal,date,byte
-   * this method is more effective than createZIndexDataFrameBySample
+   * this method is more effective than createOptimizeDataFrameBySample
    *
    * @param df a spark DataFrame holds parquet files to be read.
-   * @param zCols z-sort cols
+   * @param sortCols z-sort/hilbert-sort cols
    * @param fileNum spark partition num
-   * @return a dataFrame sorted by z-order.
+   * @param sortMode layout optimization strategy
+   * @return a dataFrame sorted by z-order/hilbert.

Review comment:
       similar here.

##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/spark/SpaceCurveOptimizeHelper.java
##########
@@ -67,40 +69,62 @@
 import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.Collection;
+import java.util.Iterator;
 import java.util.List;
 import java.util.Map;
 import java.util.stream.Collectors;
 
-public class ZCurveOptimizeHelper {
+public class SpaceCurveOptimizeHelper {
 
   private static final String SPARK_JOB_DESCRIPTION = "spark.job.description";
 
   /**
-   * Create z-order DataFrame directly
-   * first, map all base type data to byte[8], then create z-order DataFrame
+   * Create optimized DataFrame directly
    * only support base type data. long,int,short,double,float,string,timestamp,decimal,date,byte
-   * this method is more effective than createZIndexDataFrameBySample
+   * this method is more effective than createOptimizeDataFrameBySample
    *
    * @param df a spark DataFrame holds parquet files to be read.
-   * @param zCols z-sort cols
+   * @param sortCols z-sort/hilbert-sort cols

Review comment:
       nit: `z-sort/hilbert-sort cols` -> `sorting columns`? (no need to mention sorting mechanism here, to be general)

##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/optimize/ZOrderingUtil.java
##########
@@ -176,9 +176,17 @@ public static byte updatePos(byte a, int apos, byte b, int bpos) {
 
   public static Long convertStringToLong(String a) {
     byte[] bytes = utf8To8Byte(a);
+    return convertBytesToLong(bytes);
+  }
+
+  public static long convertBytesToLong(byte[] bytes) {
+    byte[] padBytes = bytes;
+    if (bytes.length != 8) {
+      padBytes = paddingTo8Byte(bytes);
+    }

Review comment:
       you can simply have `byte[] paddedBytes = paddingTo8Byte(bytes);` since inside `paddingTo8Byte()` there is already check for the length.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org