You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/10/22 13:52:07 UTC

[GitHub] [hudi] wangxianghu opened a new pull request #2200: [HUDI-912][WIP]Refactor and relocate KeyGenerator to support more engines

wangxianghu opened a new pull request #2200:
URL: https://github.com/apache/hudi/pull/2200


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *Refactor and relocate KeyGenerator to support more engines*
   
   ## Brief change log
   
   *Currently, `keyGenerator`s are implemented in `hudi-spark` module,  they can only be used by spark engine.
   
   Since `keyGenerator` is a core tool for hudi, they should be engine-independent.*
   
   ## Verify this pull request
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] leesf commented on a change in pull request #2200: [HUDI-912]Refactor and relocate KeyGenerator to support more engines

Posted by GitBox <gi...@apache.org>.
leesf commented on a change in pull request #2200:
URL: https://github.com/apache/hudi/pull/2200#discussion_r513483904



##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/SparkCustomKeyGenerator.java
##########
@@ -0,0 +1,134 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.exception.HoodieDeltaStreamerException;
+import org.apache.hudi.exception.HoodieKeyException;
+import org.apache.hudi.keygen.common.KeyGeneratorOptions;
+import org.apache.spark.sql.Row;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.stream.Collectors;
+
+/**
+ * This is a generic implementation of KeyGenerator where users can configure record key as a single field or a combination of fields. Similarly partition path can be configured to have multiple
+ * fields or only one field. This class expects value for prop "hoodie.datasource.write.partitionpath.field" in a specific format. For example:
+ *
+ * properties.put("hoodie.datasource.write.partitionpath.field", "field1:PartitionKeyType1,field2:PartitionKeyType2").
+ *
+ * The complete partition path is created as <value for field1 basis PartitionKeyType1>/<value for field2 basis PartitionKeyType2> and so on.
+ *
+ * Few points to consider: 1. If you want to customize some partition path field on a timestamp basis, you can use field1:timestampBased 2. If you simply want to have the value of your configured
+ * field in the partition path, use field1:simple 3. If you want your table to be non partitioned, simply leave it as blank.
+ *
+ * RecordKey is internally generated using either SimpleKeyGenerator or ComplexKeyGenerator.
+ */
+public class SparkCustomKeyGenerator extends BaseSparkKeyGenerator {
+
+  private CustomKeyGenerator customKeyGenerator;
+
+  public SparkCustomKeyGenerator(TypedProperties props) {
+    super(props);
+    this.recordKeyFields = Arrays.stream(props.getString(KeyGeneratorOptions.RECORDKEY_FIELD_OPT_KEY).split(",")).map(String::trim).collect(Collectors.toList());
+    this.partitionPathFields = Arrays.stream(props.getString(KeyGeneratorOptions.PARTITIONPATH_FIELD_OPT_KEY).split(",")).map(String::trim).collect(Collectors.toList());
+    customKeyGenerator = new CustomKeyGenerator(props);
+  }
+
+  @Override
+  public String getRecordKey(GenericRecord record) {
+    return customKeyGenerator.getRecordKey(record);
+  }
+
+  @Override
+  public String getPartitionPath(GenericRecord record) {
+    return customKeyGenerator.getPartitionPath(record);
+  }
+
+  @Override
+  public String getRecordKey(Row row) {
+    validateRecordKeyFields();
+    return getRecordKeyFields().size() == 1
+        ? new SparkSimpleKeyGenerator(config).getRecordKey(row)
+        : new SparkComplexKeyGenerator(config).getRecordKey(row);
+  }
+
+  @Override
+  public String getPartitionPath(Row row) {
+    return getPartitionPath(Option.empty(), Option.of(row));
+  }
+
+  private String getPartitionPath(Option<GenericRecord> record, Option<Row> row) {
+    if (getPartitionPathFields() == null) {
+      throw new HoodieKeyException("Unable to find field names for partition path in cfg");
+    }
+
+    String partitionPathField;
+    StringBuilder partitionPath = new StringBuilder();
+
+    //Corresponds to no partition case
+    if (getPartitionPathFields().size() == 1 && getPartitionPathFields().get(0).isEmpty()) {
+      return "";
+    }
+    for (String field : getPartitionPathFields()) {
+      String[] fieldWithType = field.split(customKeyGenerator.getSplitRegex());
+      if (fieldWithType.length != 2) {
+        throw new HoodieKeyException("Unable to find field names for partition path in proper format");
+      }
+
+      partitionPathField = fieldWithType[0];
+      CustomKeyGenerator.PartitionKeyType keyType = CustomKeyGenerator.PartitionKeyType.valueOf(fieldWithType[1].toUpperCase());
+      switch (keyType) {
+        case SIMPLE:
+          if (record.isPresent()) {
+            partitionPath.append(new SparkSimpleKeyGenerator(config, partitionPathField).getPartitionPath(record.get()));
+          } else {
+            partitionPath.append(new SparkSimpleKeyGenerator(config, partitionPathField).getPartitionPath(row.get()));
+          }
+          break;
+        case TIMESTAMP:
+          try {
+            if (record.isPresent()) {
+              partitionPath.append(new SparkTimestampBasedKeyGenerator(config, partitionPathField).getPartitionPath(record.get()));
+            } else {
+              partitionPath.append(new SparkTimestampBasedKeyGenerator(config, partitionPathField).getPartitionPath(row.get()));
+            }

Review comment:
       looks like the above logic move to this place.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] yanghua commented on pull request #2200: [HUDI-912]Refactor and relocate KeyGenerator to support more engines

Posted by GitBox <gi...@apache.org>.
yanghua commented on pull request #2200:
URL: https://github.com/apache/hudi/pull/2200#issuecomment-716324906


   @wangxianghu Did you check if this PR changes some generator classes' package name? Some classes may exist in the doc.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on a change in pull request #2200: [HUDI-912]Refactor and relocate KeyGenerator to support more engines

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #2200:
URL: https://github.com/apache/hudi/pull/2200#discussion_r516256499



##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/ComplexKeyGenerator.java
##########
@@ -18,38 +17,38 @@
 
 package org.apache.hudi.keygen;
 
-import org.apache.hudi.DataSourceWriteOptions;
-import org.apache.hudi.common.config.TypedProperties;
-
 import org.apache.avro.generic.GenericRecord;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.keygen.constant.KeyGeneratorOptions;
+import org.apache.spark.sql.Row;
 
 import java.util.Arrays;
 import java.util.stream.Collectors;
-import org.apache.spark.sql.Row;
 
 /**
  * Complex key generator, which takes names of fields to be used for recordKey and partitionPath as configs.
  */
 public class ComplexKeyGenerator extends BuiltinKeyGenerator {
 
-  public static final String DEFAULT_RECORD_KEY_SEPARATOR = ":";
+  private CommonComplexKeyGenerator commonComplexKeyGenerator;
 
   public ComplexKeyGenerator(TypedProperties props) {
     super(props);
-    this.recordKeyFields = Arrays.stream(props.getString(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY())
+    this.recordKeyFields = Arrays.stream(props.getString(KeyGeneratorOptions.RECORDKEY_FIELD_OPT_KEY)

Review comment:
       ah ok. no problem. this makes sense 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] wangxianghu commented on a change in pull request #2200: [HUDI-912]Refactor and relocate KeyGenerator to support more engines

Posted by GitBox <gi...@apache.org>.
wangxianghu commented on a change in pull request #2200:
URL: https://github.com/apache/hudi/pull/2200#discussion_r514950337



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/CommonComplexKeyGenerator.java
##########
@@ -0,0 +1,50 @@
+/*

Review comment:
       Those `CommonKeyGenerator`s can be used for both `flink` and `java` engine. 
   I will add some flink specific `KeyGenerator`s  when flink implementation(https://github.com/apache/hudi/pull/2176) is  landed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] wangxianghu commented on pull request #2200: [HUDI-912]Refactor and relocate KeyGenerator to support more engines

Posted by GitBox <gi...@apache.org>.
wangxianghu commented on pull request #2200:
URL: https://github.com/apache/hudi/pull/2200#issuecomment-716084752


   > @wangxianghu High level question before I start reviewing in depth.
   > 
   > Does this change break existing custom KeyGenerator implementations.
   
   it is the same as before.
   just extract a spark-independent version, wrap it as a spark one. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar merged pull request #2200: [HUDI-912]Refactor and relocate KeyGenerator to support more engines

Posted by GitBox <gi...@apache.org>.
vinothchandar merged pull request #2200:
URL: https://github.com/apache/hudi/pull/2200


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on a change in pull request #2200: [HUDI-912]Refactor and relocate KeyGenerator to support more engines

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #2200:
URL: https://github.com/apache/hudi/pull/2200#discussion_r514763942



##########
File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/BootstrapCommand.java
##########
@@ -68,7 +68,7 @@ public String bootstrap(
           help = "Bootstrap Index Class") final String bootstrapIndexClass,
       @CliOption(key = {"selectorClass"}, unspecifiedDefaultValue = "org.apache.hudi.client.bootstrap.selector.MetadataOnlyBootstrapModeSelector",
           help = "Selector class for bootstrap") final String selectorClass,
-      @CliOption(key = {"keyGeneratorClass"}, unspecifiedDefaultValue = "org.apache.hudi.keygen.SimpleKeyGenerator",
+      @CliOption(key = {"keyGeneratorClass"}, unspecifiedDefaultValue = "org.apache.hudi.keygen.SparkSimpleKeyGenerator",

Review comment:
       This is what I meant in the first comment. we have to be backwards compatible with existing clients. It will be disruptive change for everyone to reconfigure all their jobs. Can we avoid renaming of keygenerator classes or changing their package names




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] leesf commented on a change in pull request #2200: [HUDI-912]Refactor and relocate KeyGenerator to support more engines

Posted by GitBox <gi...@apache.org>.
leesf commented on a change in pull request #2200:
URL: https://github.com/apache/hudi/pull/2200#discussion_r513480391



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/CustomKeyGenerator.java
##########
@@ -94,27 +82,18 @@ private String getPartitionPath(Option<GenericRecord> record, Option<Row> row) {
       PartitionKeyType keyType = PartitionKeyType.valueOf(fieldWithType[1].toUpperCase());
       switch (keyType) {
         case SIMPLE:
-          if (record.isPresent()) {
-            partitionPath.append(new SimpleKeyGenerator(config, partitionPathField).getPartitionPath(record.get()));
-          } else {
-            partitionPath.append(new SimpleKeyGenerator(config, partitionPathField).getPartitionPath(row.get()));
-          }
+          partitionPath.append(new SimpleKeyGenerator(config, partitionPathField).getPartitionPath(record));

Review comment:
       would this change the above behavior? above would also get `row.get`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] wangxianghu commented on pull request #2200: [HUDI-912]Refactor and relocate KeyGenerator to support more engines

Posted by GitBox <gi...@apache.org>.
wangxianghu commented on pull request #2200:
URL: https://github.com/apache/hudi/pull/2200#issuecomment-719464715


   @vinothchandar this pr is ready for review now


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on pull request #2200: [HUDI-912]Refactor and relocate KeyGenerator to support more engines

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #2200:
URL: https://github.com/apache/hudi/pull/2200#issuecomment-719242232


   For context, we did change this once, and users did not like it (undertstandably) that all of their jobs needed to change
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] wangxianghu edited a comment on pull request #2200: [HUDI-912]Refactor and relocate KeyGenerator to support more engines

Posted by GitBox <gi...@apache.org>.
wangxianghu edited a comment on pull request #2200:
URL: https://github.com/apache/hudi/pull/2200#issuecomment-716084752


   > @wangxianghu High level question before I start reviewing in depth.
   > 
   > Does this change break existing custom KeyGenerator implementations.
   
   @vinothchandar 
   it is the same as before.
   just extract a spark-independent version, wrap it as a spark one. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on pull request #2200: [HUDI-912]Refactor and relocate KeyGenerator to support more engines

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #2200:
URL: https://github.com/apache/hudi/pull/2200#issuecomment-716057557


   @wangxianghu High level question before I start reviewing in depth.
   
   Does this change break existing custom KeyGenerator implementations. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] wangxianghu commented on a change in pull request #2200: [HUDI-912]Refactor and relocate KeyGenerator to support more engines

Posted by GitBox <gi...@apache.org>.
wangxianghu commented on a change in pull request #2200:
URL: https://github.com/apache/hudi/pull/2200#discussion_r514782542



##########
File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/BootstrapCommand.java
##########
@@ -68,7 +68,7 @@ public String bootstrap(
           help = "Bootstrap Index Class") final String bootstrapIndexClass,
       @CliOption(key = {"selectorClass"}, unspecifiedDefaultValue = "org.apache.hudi.client.bootstrap.selector.MetadataOnlyBootstrapModeSelector",
           help = "Selector class for bootstrap") final String selectorClass,
-      @CliOption(key = {"keyGeneratorClass"}, unspecifiedDefaultValue = "org.apache.hudi.keygen.SimpleKeyGenerator",
+      @CliOption(key = {"keyGeneratorClass"}, unspecifiedDefaultValue = "org.apache.hudi.keygen.SparkSimpleKeyGenerator",

Review comment:
       > This is what I meant in the first comment. we have to be backwards compatible with existing clients. It will be disruptive change for everyone to reconfigure all their jobs. Can we avoid renaming of keygenerator classes or changing their package names
   
   got it. 
   for speak, we leave the ket generator names untouched and rename it for other engines.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io commented on pull request #2200: [HUDI-912]Refactor and relocate KeyGenerator to support more engines

Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #2200:
URL: https://github.com/apache/hudi/pull/2200#issuecomment-715661099


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2200?src=pr&el=h1) Report
   > Merging [#2200](https://codecov.io/gh/apache/hudi/pull/2200?src=pr&el=desc) into [master](https://codecov.io/gh/apache/hudi/commit/14c4611857ea796b8d3aef96f867db9cd0ae31f7?el=desc) will **decrease** coverage by `0.54%`.
   > The diff coverage is `91.66%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2200/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2200?src=pr&el=tree)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #2200      +/-   ##
   ============================================
   - Coverage     53.68%   53.13%   -0.55%     
   + Complexity     2849     2700     -149     
   ============================================
     Files           359      344      -15     
     Lines         16565    15885     -680     
     Branches       1782     1633     -149     
   ============================================
   - Hits           8893     8441     -452     
   + Misses         6915     6756     -159     
   + Partials        757      688      -69     
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | #hudicli | `38.37% <ø> (ø)` | `193.00 <ø> (ø)` | |
   | #hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | #hudicommon | `54.66% <ø> (-0.04%)` | `1794.00 <ø> (ø)` | |
   | #hudihadoopmr | `33.05% <ø> (ø)` | `181.00 <ø> (ø)` | |
   | #hudispark | `65.91% <91.66%> (-0.04%)` | `155.00 <0.00> (-149.00)` | |
   | #huditimelineservice | `62.29% <ø> (ø)` | `50.00 <ø> (ø)` | |
   | #hudiutilities | `70.09% <ø> (ø)` | `327.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2200?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [...org/apache/hudi/cli/commands/BootstrapCommand.java](https://codecov.io/gh/apache/hudi/pull/2200/diff?src=pr&el=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL2NvbW1hbmRzL0Jvb3RzdHJhcENvbW1hbmQuamF2YQ==) | `1.78% <ø> (ø)` | `1.00 <0.00> (ø)` | |
   | [...cala/org/apache/hudi/HoodieBootstrapRelation.scala](https://codecov.io/gh/apache/hudi/pull/2200/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvSG9vZGllQm9vdHN0cmFwUmVsYXRpb24uc2NhbGE=) | `87.20% <ø> (ø)` | `15.00 <0.00> (ø)` | |
   | [...n/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala](https://codecov.io/gh/apache/hudi/pull/2200/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvSG9vZGllTWVyZ2VPblJlYWRSREQuc2NhbGE=) | `82.19% <ø> (ø)` | `10.00 <0.00> (ø)` | |
   | [...in/scala/org/apache/hudi/IncrementalRelation.scala](https://codecov.io/gh/apache/hudi/pull/2200/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvSW5jcmVtZW50YWxSZWxhdGlvbi5zY2FsYQ==) | `76.19% <ø> (ø)` | `20.00 <0.00> (ø)` | |
   | [.../org/apache/hudi/MergeOnReadSnapshotRelation.scala](https://codecov.io/gh/apache/hudi/pull/2200/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvTWVyZ2VPblJlYWRTbmFwc2hvdFJlbGF0aW9uLnNjYWxh) | `90.80% <ø> (ø)` | `16.00 <0.00> (ø)` | |
   | [...in/java/org/apache/hudi/utilities/UtilHelpers.java](https://codecov.io/gh/apache/hudi/pull/2200/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL1V0aWxIZWxwZXJzLmphdmE=) | `64.19% <ø> (ø)` | `30.00 <0.00> (ø)` | |
   | [...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2200/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=) | `70.54% <ø> (ø)` | `49.00 <0.00> (ø)` | |
   | [...i/utilities/deltastreamer/SourceFormatAdapter.java](https://codecov.io/gh/apache/hudi/pull/2200/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvU291cmNlRm9ybWF0QWRhcHRlci5qYXZh) | `86.48% <ø> (ø)` | `11.00 <0.00> (ø)` | |
   | [.../hudi/utilities/schema/RowBasedSchemaProvider.java](https://codecov.io/gh/apache/hudi/pull/2200/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9Sb3dCYXNlZFNjaGVtYVByb3ZpZGVyLmphdmE=) | `66.66% <ø> (ø)` | `2.00 <0.00> (ø)` | |
   | [...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala](https://codecov.io/gh/apache/hudi/pull/2200/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvSG9vZGllU3BhcmtTcWxXcml0ZXIuc2NhbGE=) | `50.95% <50.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | ... and [6 more](https://codecov.io/gh/apache/hudi/pull/2200/diff?src=pr&el=tree-more) | |
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] wangxianghu commented on pull request #2200: [HUDI-912]Refactor and relocate KeyGenerator to support more engines

Posted by GitBox <gi...@apache.org>.
wangxianghu commented on pull request #2200:
URL: https://github.com/apache/hudi/pull/2200#issuecomment-716330903


   > @wangxianghu Did you check if this PR changes some generator classes' package name? Some classes may exist in the doc.
   
   @yanghua yes, it's changed.
   I have filed a ticket to update the doc(https://issues.apache.org/jira/browse/HUDI-913),  will push it after this pr landed


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codecov-io edited a comment on pull request #2200: [HUDI-912]Refactor and relocate KeyGenerator to support more engines

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #2200:
URL: https://github.com/apache/hudi/pull/2200#issuecomment-715661099


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2200?src=pr&el=h1) Report
   > Merging [#2200](https://codecov.io/gh/apache/hudi/pull/2200?src=pr&el=desc) into [master](https://codecov.io/gh/apache/hudi/commit/14c4611857ea796b8d3aef96f867db9cd0ae31f7?el=desc) will **decrease** coverage by `0.54%`.
   > The diff coverage is `91.66%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2200/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2200?src=pr&el=tree)
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #2200      +/-   ##
   ============================================
   - Coverage     53.68%   53.13%   -0.55%     
   + Complexity     2849     2700     -149     
   ============================================
     Files           359      344      -15     
     Lines         16565    15885     -680     
     Branches       1782     1633     -149     
   ============================================
   - Hits           8893     8441     -452     
   + Misses         6915     6756     -159     
   + Partials        757      688      -69     
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | #hudicli | `38.37% <ø> (ø)` | `193.00 <ø> (ø)` | |
   | #hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | #hudicommon | `54.66% <ø> (-0.04%)` | `1794.00 <ø> (ø)` | |
   | #hudihadoopmr | `33.05% <ø> (ø)` | `181.00 <ø> (ø)` | |
   | #hudispark | `65.91% <91.66%> (-0.04%)` | `155.00 <0.00> (-149.00)` | |
   | #huditimelineservice | `62.29% <ø> (ø)` | `50.00 <ø> (ø)` | |
   | #hudiutilities | `70.09% <ø> (ø)` | `327.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2200?src=pr&el=tree) | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | [...org/apache/hudi/cli/commands/BootstrapCommand.java](https://codecov.io/gh/apache/hudi/pull/2200/diff?src=pr&el=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL2NvbW1hbmRzL0Jvb3RzdHJhcENvbW1hbmQuamF2YQ==) | `1.78% <ø> (ø)` | `1.00 <0.00> (ø)` | |
   | [...cala/org/apache/hudi/HoodieBootstrapRelation.scala](https://codecov.io/gh/apache/hudi/pull/2200/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvSG9vZGllQm9vdHN0cmFwUmVsYXRpb24uc2NhbGE=) | `87.20% <ø> (ø)` | `15.00 <0.00> (ø)` | |
   | [...n/scala/org/apache/hudi/HoodieMergeOnReadRDD.scala](https://codecov.io/gh/apache/hudi/pull/2200/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvSG9vZGllTWVyZ2VPblJlYWRSREQuc2NhbGE=) | `82.19% <ø> (ø)` | `10.00 <0.00> (ø)` | |
   | [...in/scala/org/apache/hudi/IncrementalRelation.scala](https://codecov.io/gh/apache/hudi/pull/2200/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvSW5jcmVtZW50YWxSZWxhdGlvbi5zY2FsYQ==) | `76.19% <ø> (ø)` | `20.00 <0.00> (ø)` | |
   | [.../org/apache/hudi/MergeOnReadSnapshotRelation.scala](https://codecov.io/gh/apache/hudi/pull/2200/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvTWVyZ2VPblJlYWRTbmFwc2hvdFJlbGF0aW9uLnNjYWxh) | `90.80% <ø> (ø)` | `16.00 <0.00> (ø)` | |
   | [...in/java/org/apache/hudi/utilities/UtilHelpers.java](https://codecov.io/gh/apache/hudi/pull/2200/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL1V0aWxIZWxwZXJzLmphdmE=) | `64.19% <ø> (ø)` | `30.00 <0.00> (ø)` | |
   | [...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2200/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=) | `70.54% <ø> (ø)` | `49.00 <0.00> (ø)` | |
   | [...i/utilities/deltastreamer/SourceFormatAdapter.java](https://codecov.io/gh/apache/hudi/pull/2200/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvU291cmNlRm9ybWF0QWRhcHRlci5qYXZh) | `86.48% <ø> (ø)` | `11.00 <0.00> (ø)` | |
   | [.../hudi/utilities/schema/RowBasedSchemaProvider.java](https://codecov.io/gh/apache/hudi/pull/2200/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9Sb3dCYXNlZFNjaGVtYVByb3ZpZGVyLmphdmE=) | `66.66% <ø> (ø)` | `2.00 <0.00> (ø)` | |
   | [...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala](https://codecov.io/gh/apache/hudi/pull/2200/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvSG9vZGllU3BhcmtTcWxXcml0ZXIuc2NhbGE=) | `50.95% <50.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | ... and [6 more](https://codecov.io/gh/apache/hudi/pull/2200/diff?src=pr&el=tree-more) | |
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] leesf commented on a change in pull request #2200: [HUDI-912]Refactor and relocate KeyGenerator to support more engines

Posted by GitBox <gi...@apache.org>.
leesf commented on a change in pull request #2200:
URL: https://github.com/apache/hudi/pull/2200#discussion_r513482748



##########
File path: hudi-client/hudi-spark-client/pom.xml
##########
@@ -31,6 +31,13 @@
   <packaging>jar</packaging>
 
   <dependencies>
+    <!-- Scala -->
+    <dependency>
+      <groupId>org.scala-lang</groupId>
+      <artifactId>scala-library</artifactId>
+      <version>${scala.version}</version>
+    </dependency>
+

Review comment:
       does it exist before?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] wangxianghu commented on a change in pull request #2200: [HUDI-912]Refactor and relocate KeyGenerator to support more engines

Posted by GitBox <gi...@apache.org>.
wangxianghu commented on a change in pull request #2200:
URL: https://github.com/apache/hudi/pull/2200#discussion_r513856075



##########
File path: hudi-client/hudi-spark-client/pom.xml
##########
@@ -31,6 +31,13 @@
   <packaging>jar</packaging>
 
   <dependencies>
+    <!-- Scala -->
+    <dependency>
+      <groupId>org.scala-lang</groupId>
+      <artifactId>scala-library</artifactId>
+      <version>${scala.version}</version>
+    </dependency>
+

Review comment:
       > does it exist before?
   no, I moved some utils(`AvroConversionHelper`,`AvroConversionUtils`) from `hudi-spark`, they are implemented in scala.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on pull request #2200: [HUDI-912]Refactor and relocate KeyGenerator to support more engines

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #2200:
URL: https://github.com/apache/hudi/pull/2200#issuecomment-719241457


   @wangxianghu yes we can call the Flink ones, `FlinkSimpleKeyGenerator` or put it under a `org.apache.hudi.keygen.flink.` package


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] wangxianghu commented on a change in pull request #2200: [HUDI-912]Refactor and relocate KeyGenerator to support more engines

Posted by GitBox <gi...@apache.org>.
wangxianghu commented on a change in pull request #2200:
URL: https://github.com/apache/hudi/pull/2200#discussion_r513856697



##########
File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/SparkCustomKeyGenerator.java
##########
@@ -0,0 +1,134 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.exception.HoodieDeltaStreamerException;
+import org.apache.hudi.exception.HoodieKeyException;
+import org.apache.hudi.keygen.common.KeyGeneratorOptions;
+import org.apache.spark.sql.Row;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.stream.Collectors;
+
+/**
+ * This is a generic implementation of KeyGenerator where users can configure record key as a single field or a combination of fields. Similarly partition path can be configured to have multiple
+ * fields or only one field. This class expects value for prop "hoodie.datasource.write.partitionpath.field" in a specific format. For example:
+ *
+ * properties.put("hoodie.datasource.write.partitionpath.field", "field1:PartitionKeyType1,field2:PartitionKeyType2").
+ *
+ * The complete partition path is created as <value for field1 basis PartitionKeyType1>/<value for field2 basis PartitionKeyType2> and so on.
+ *
+ * Few points to consider: 1. If you want to customize some partition path field on a timestamp basis, you can use field1:timestampBased 2. If you simply want to have the value of your configured
+ * field in the partition path, use field1:simple 3. If you want your table to be non partitioned, simply leave it as blank.
+ *
+ * RecordKey is internally generated using either SimpleKeyGenerator or ComplexKeyGenerator.
+ */
+public class SparkCustomKeyGenerator extends BaseSparkKeyGenerator {
+
+  private CustomKeyGenerator customKeyGenerator;
+
+  public SparkCustomKeyGenerator(TypedProperties props) {
+    super(props);
+    this.recordKeyFields = Arrays.stream(props.getString(KeyGeneratorOptions.RECORDKEY_FIELD_OPT_KEY).split(",")).map(String::trim).collect(Collectors.toList());
+    this.partitionPathFields = Arrays.stream(props.getString(KeyGeneratorOptions.PARTITIONPATH_FIELD_OPT_KEY).split(",")).map(String::trim).collect(Collectors.toList());
+    customKeyGenerator = new CustomKeyGenerator(props);
+  }
+
+  @Override
+  public String getRecordKey(GenericRecord record) {
+    return customKeyGenerator.getRecordKey(record);
+  }
+
+  @Override
+  public String getPartitionPath(GenericRecord record) {
+    return customKeyGenerator.getPartitionPath(record);
+  }
+
+  @Override
+  public String getRecordKey(Row row) {
+    validateRecordKeyFields();
+    return getRecordKeyFields().size() == 1
+        ? new SparkSimpleKeyGenerator(config).getRecordKey(row)
+        : new SparkComplexKeyGenerator(config).getRecordKey(row);
+  }
+
+  @Override
+  public String getPartitionPath(Row row) {
+    return getPartitionPath(Option.empty(), Option.of(row));
+  }
+
+  private String getPartitionPath(Option<GenericRecord> record, Option<Row> row) {
+    if (getPartitionPathFields() == null) {
+      throw new HoodieKeyException("Unable to find field names for partition path in cfg");
+    }
+
+    String partitionPathField;
+    StringBuilder partitionPath = new StringBuilder();
+
+    //Corresponds to no partition case
+    if (getPartitionPathFields().size() == 1 && getPartitionPathFields().get(0).isEmpty()) {
+      return "";
+    }
+    for (String field : getPartitionPathFields()) {
+      String[] fieldWithType = field.split(customKeyGenerator.getSplitRegex());
+      if (fieldWithType.length != 2) {
+        throw new HoodieKeyException("Unable to find field names for partition path in proper format");
+      }
+
+      partitionPathField = fieldWithType[0];
+      CustomKeyGenerator.PartitionKeyType keyType = CustomKeyGenerator.PartitionKeyType.valueOf(fieldWithType[1].toUpperCase());
+      switch (keyType) {
+        case SIMPLE:
+          if (record.isPresent()) {
+            partitionPath.append(new SparkSimpleKeyGenerator(config, partitionPathField).getPartitionPath(record.get()));
+          } else {
+            partitionPath.append(new SparkSimpleKeyGenerator(config, partitionPathField).getPartitionPath(row.get()));
+          }
+          break;
+        case TIMESTAMP:
+          try {
+            if (record.isPresent()) {
+              partitionPath.append(new SparkTimestampBasedKeyGenerator(config, partitionPathField).getPartitionPath(record.get()));
+            } else {
+              partitionPath.append(new SparkTimestampBasedKeyGenerator(config, partitionPathField).getPartitionPath(row.get()));
+            }

Review comment:
       > looks like the above logic move to this place.
   
   yes, moved here




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] wangxianghu commented on pull request #2200: [HUDI-912]Refactor and relocate KeyGenerator to support more engines

Posted by GitBox <gi...@apache.org>.
wangxianghu commented on pull request #2200:
URL: https://github.com/apache/hudi/pull/2200#issuecomment-718687816


   Hi @vinothchandar, please help review, this pr blocks https://github.com/apache/hudi/pull/2176  cc  @yanghua @leesf 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] wangxianghu commented on a change in pull request #2200: [HUDI-912]Refactor and relocate KeyGenerator to support more engines

Posted by GitBox <gi...@apache.org>.
wangxianghu commented on a change in pull request #2200:
URL: https://github.com/apache/hudi/pull/2200#discussion_r514950337



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/CommonComplexKeyGenerator.java
##########
@@ -0,0 +1,50 @@
+/*

Review comment:
       Those `CommonKeyGenerator`s can be used for both `flink` and `java` engine. 
   I will add some flink specific `KeyGenerator`s  when flink implementation is  landed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] wangxianghu commented on pull request #2200: [HUDI-912]Refactor and relocate KeyGenerator to support more engines

Posted by GitBox <gi...@apache.org>.
wangxianghu commented on pull request #2200:
URL: https://github.com/apache/hudi/pull/2200#issuecomment-715381186


   @yanghua please take a look when free


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] wangxianghu commented on pull request #2200: [HUDI-912]Refactor and relocate KeyGenerator to support more engines

Posted by GitBox <gi...@apache.org>.
wangxianghu commented on pull request #2200:
URL: https://github.com/apache/hudi/pull/2200#issuecomment-719248711


   > For context, we did change this once, and users did not like it (undertstandably) that all of their jobs needed to change
   
   Yes,I get your point
   working on it now  :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org