You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/01/06 22:07:29 UTC

[GitHub] [incubator-hudi] bschell opened a new pull request #1194: [HUDI-326] Add support to delete records with only record_key

bschell opened a new pull request #1194: [HUDI-326] Add support to delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194
 
 
   Allows HoodieSparkSqlWriter to delete records from a dataframe containing only record keys when using global indices because delete does not use partition from global indices.
   
   ## What is the purpose of the pull request
   
   This pull request allows Hudi deletes with only record_key for global indices. I don't see any way to avoid this because keyGenerators will always fail if the partition value field is not included. Having an API with delete(List<RecordKey> deleteKeys) will not be sufficient either as any keyGenerator transformations on the record_key would not happen then. This works around this by allowing the keyGenerators to always pass.
   
   ## Verify this pull request
   
   This pull request has minimal changes. The changes have been manually tested and verified. 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on issue #1194: [HUDI-326] Add support to delete records with only record_key

Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #1194: [HUDI-326] Add support to delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194#issuecomment-571389349
 
 
   Also, the tests seem to be failing, could you take a look ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194#discussion_r363554374
 
 

 ##########
 File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
 ##########
 @@ -161,7 +162,17 @@ private[hudi] object HoodieSparkSqlWriter {
       // Convert to RDD[HoodieKey]
       val keyGenerator = DataSourceUtils.createKeyGenerator(toProperties(parameters))
       val genericRecords: RDD[GenericRecord] = AvroConversionUtils.createRdd(df, structName, nameSpace)
-      val hoodieKeysToDelete = genericRecords.map(gr => keyGenerator.getKey(gr)).toJavaRDD()
+      val hoodieKeysToDelete = if (HoodieIndex.GLOBAL_INDICES.contains(parameters(HoodieIndexConfig.INDEX_TYPE_PROP))) {
 
 Review comment:
   There is a NonPartitionedKeyGenerator that you can pass and use for deletions. Since as a client you are aware of 2 facts a) Either you are using GLOBAL_INDEX or not b) If this batch is for Deletions, you can configure the hudi client accordingly with a correct partition generator -> https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/NonpartitionedKeyGenerator.java
   May be we can extend this class or rename it to reflect the correct key generation strategy for deletions to be explicit 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key

Posted by GitBox <gi...@apache.org>.
bvaradar commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194#discussion_r366088953
 
 

 ##########
 File path: hudi-spark/src/main/java/org/apache/hudi/keygen/ComplexKeyGenerator.java
 ##########
 @@ -16,8 +16,10 @@
  * limitations under the License.
  */
 
-package org.apache.hudi;
+package org.apache.hudi.keygen;
 
 Review comment:
   This is a backwards incompatible change. Users would have custom key generators using configuration : DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY()
   
   It makes sense to move to separate package but we need to call out the change in release notes . Please open a tracking ticket to update release notes for this.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194#discussion_r365046553
 
 

 ##########
 File path: hudi-spark/src/main/java/org/apache/hudi/keygen/GlobalDeleteKeyGenerator.java
 ##########
 @@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import java.util.Arrays;
+import java.util.List;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.exception.HoodieKeyException;
+
+/**
+ * Key generator for deletes using global indices. Global index deletes do not require partition value
+ * so this key generator avoids using partition value for generating HoodieKey.
+ */
+public class GlobalDeleteKeyGenerator extends KeyGenerator {
+
+  private static final String EMPTY_PARTITION = "";
+  private static final String NULL_RECORDKEY_PLACEHOLDER = "__null__";
 
 Review comment:
   If the recordkey is "null" what is the advantage of creating a record with this string ? There wouldn't be any records with that value in the first place right ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bschell commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key

Posted by GitBox <gi...@apache.org>.
bschell commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194#discussion_r365063148
 
 

 ##########
 File path: hudi-spark/src/main/java/org/apache/hudi/keygen/GlobalDeleteKeyGenerator.java
 ##########
 @@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import java.util.Arrays;
+import java.util.List;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.exception.HoodieKeyException;
+
+/**
+ * Key generator for deletes using global indices. Global index deletes do not require partition value
+ * so this key generator avoids using partition value for generating HoodieKey.
+ */
+public class GlobalDeleteKeyGenerator extends KeyGenerator {
+
+  private static final String EMPTY_PARTITION = "";
+  private static final String NULL_RECORDKEY_PLACEHOLDER = "__null__";
+  private static final String EMPTY_RECORDKEY_PLACEHOLDER = "__empty__";
+
+  protected final List<String> recordKeyFields;
+
+  public GlobalDeleteKeyGenerator(TypedProperties config) {
+    super(config);
+    this.recordKeyFields = Arrays.asList(config.getString(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY()).split(","));
+  }
+
+  @Override
+  public HoodieKey getKey(GenericRecord record) {
+    if (recordKeyFields == null) {
+      throw new HoodieKeyException("Unable to find field names for record key or partition path in cfg");
+    }
+
+    boolean keyIsNullEmpty = true;
+    StringBuilder recordKey = new StringBuilder();
+    for (String recordKeyField : recordKeyFields) {
+      String recordKeyValue = DataSourceUtils.getNestedFieldValAsString(record, recordKeyField, true);
+      if (recordKeyValue == null) {
 
 Review comment:
   Actually do you want to remove the distinction between null/empty? I would rather keep it as I have seen users with both null and empty columns.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194#discussion_r365045991
 
 

 ##########
 File path: hudi-spark/src/main/java/org/apache/hudi/keygen/GlobalDeleteKeyGenerator.java
 ##########
 @@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import java.util.Arrays;
+import java.util.List;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.exception.HoodieKeyException;
+
+/**
+ * Key generator for deletes using global indices. Global index deletes do not require partition value
+ * so this key generator avoids using partition value for generating HoodieKey.
+ */
+public class GlobalDeleteKeyGenerator extends KeyGenerator {
+
+  private static final String EMPTY_PARTITION = "";
+  private static final String NULL_RECORDKEY_PLACEHOLDER = "__null__";
+  private static final String EMPTY_RECORDKEY_PLACEHOLDER = "__empty__";
+
+  protected final List<String> recordKeyFields;
+
+  public GlobalDeleteKeyGenerator(TypedProperties config) {
+    super(config);
+    this.recordKeyFields = Arrays.asList(config.getString(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY()).split(","));
+  }
+
+  @Override
+  public HoodieKey getKey(GenericRecord record) {
+    if (recordKeyFields == null) {
+      throw new HoodieKeyException("Unable to find field names for record key or partition path in cfg");
+    }
+
+    boolean keyIsNullEmpty = true;
+    StringBuilder recordKey = new StringBuilder();
+    for (String recordKeyField : recordKeyFields) {
+      String recordKeyValue = DataSourceUtils.getNestedFieldValAsString(record, recordKeyField, true);
+      if (recordKeyValue == null) {
 
 Review comment:
   Can you use this : https://github.com/apache/incubator-hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/util/StringUtils.java#L67 ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bschell commented on issue #1194: [HUDI-326] Add support to delete records with only record_key

Posted by GitBox <gi...@apache.org>.
bschell commented on issue #1194: [HUDI-326] Add support to delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194#issuecomment-572716995
 
 
   @n3nash just updated the pull request to a new GlobalDelete key generator and the tests pass locally for me.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194#discussion_r365478024
 
 

 ##########
 File path: hudi-spark/src/main/java/org/apache/hudi/keygen/GlobalDeleteKeyGenerator.java
 ##########
 @@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import java.util.Arrays;
+import java.util.List;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.exception.HoodieKeyException;
+
+/**
+ * Key generator for deletes using global indices. Global index deletes do not require partition value
+ * so this key generator avoids using partition value for generating HoodieKey.
+ */
+public class GlobalDeleteKeyGenerator extends KeyGenerator {
+
+  private static final String EMPTY_PARTITION = "";
+  private static final String NULL_RECORDKEY_PLACEHOLDER = "__null__";
+  private static final String EMPTY_RECORDKEY_PLACEHOLDER = "__empty__";
+
+  protected final List<String> recordKeyFields;
+
+  public GlobalDeleteKeyGenerator(TypedProperties config) {
+    super(config);
+    this.recordKeyFields = Arrays.asList(config.getString(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY()).split(","));
+  }
+
+  @Override
+  public HoodieKey getKey(GenericRecord record) {
+    if (recordKeyFields == null) {
+      throw new HoodieKeyException("Unable to find field names for record key or partition path in cfg");
+    }
+
+    boolean keyIsNullEmpty = true;
+    StringBuilder recordKey = new StringBuilder();
+    for (String recordKeyField : recordKeyFields) {
+      String recordKeyValue = DataSourceUtils.getNestedFieldValAsString(record, recordKeyField, true);
+      if (recordKeyValue == null) {
 
 Review comment:
   Interesting, I think we should fix the ComplexKeyGenerator as well, writing extra bytes like "__null__" just to denote null values doesn't seem like the best idea for data lakes. 
   Can you open a ticket for addressing that ? We can for now keep this implementation consistent with ComplexKeyGenerator. Once you open the ticket, please link it here and squash your commits to 1 then I can merge this PR, thanks

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bschell commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key

Posted by GitBox <gi...@apache.org>.
bschell commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194#discussion_r365062782
 
 

 ##########
 File path: hudi-spark/src/main/java/org/apache/hudi/keygen/GlobalDeleteKeyGenerator.java
 ##########
 @@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import java.util.Arrays;
+import java.util.List;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.exception.HoodieKeyException;
+
+/**
+ * Key generator for deletes using global indices. Global index deletes do not require partition value
+ * so this key generator avoids using partition value for generating HoodieKey.
+ */
+public class GlobalDeleteKeyGenerator extends KeyGenerator {
+
+  private static final String EMPTY_PARTITION = "";
+  private static final String NULL_RECORDKEY_PLACEHOLDER = "__null__";
 
 Review comment:
   It is for the case when the record_key is a composite key comprised of 2 different columns. One could be null while the other has a value.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194#discussion_r365478024
 
 

 ##########
 File path: hudi-spark/src/main/java/org/apache/hudi/keygen/GlobalDeleteKeyGenerator.java
 ##########
 @@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import java.util.Arrays;
+import java.util.List;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.exception.HoodieKeyException;
+
+/**
+ * Key generator for deletes using global indices. Global index deletes do not require partition value
+ * so this key generator avoids using partition value for generating HoodieKey.
+ */
+public class GlobalDeleteKeyGenerator extends KeyGenerator {
+
+  private static final String EMPTY_PARTITION = "";
+  private static final String NULL_RECORDKEY_PLACEHOLDER = "__null__";
+  private static final String EMPTY_RECORDKEY_PLACEHOLDER = "__empty__";
+
+  protected final List<String> recordKeyFields;
+
+  public GlobalDeleteKeyGenerator(TypedProperties config) {
+    super(config);
+    this.recordKeyFields = Arrays.asList(config.getString(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY()).split(","));
+  }
+
+  @Override
+  public HoodieKey getKey(GenericRecord record) {
+    if (recordKeyFields == null) {
+      throw new HoodieKeyException("Unable to find field names for record key or partition path in cfg");
+    }
+
+    boolean keyIsNullEmpty = true;
+    StringBuilder recordKey = new StringBuilder();
+    for (String recordKeyField : recordKeyFields) {
+      String recordKeyValue = DataSourceUtils.getNestedFieldValAsString(record, recordKeyField, true);
+      if (recordKeyValue == null) {
 
 Review comment:
   Interesting, I think we should fix the ComplexKeyGenerator as well, writing extra bytes like "__null__" just to denote null values doesn't seem like the best idea for data lakes. 
   Can you open a ticket for addressing that ? We can for now keep this implementation consistent with ComplexKeyGenerator.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key

Posted by GitBox <gi...@apache.org>.
bvaradar commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194#discussion_r366091392
 
 

 ##########
 File path: hudi-spark/src/main/java/org/apache/hudi/keygen/GlobalDeleteKeyGenerator.java
 ##########
 @@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import java.util.Arrays;
+import java.util.List;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.exception.HoodieKeyException;
+
+/**
+ * Key generator for deletes using global indices. Global index deletes do not require partition value
+ * so this key generator avoids using partition value for generating HoodieKey.
+ */
+public class GlobalDeleteKeyGenerator extends KeyGenerator {
+
+  private static final String EMPTY_PARTITION = "";
+  private static final String NULL_RECORDKEY_PLACEHOLDER = "__null__";
+  private static final String EMPTY_RECORDKEY_PLACEHOLDER = "__empty__";
+
+  protected final List<String> recordKeyFields;
+
+  public GlobalDeleteKeyGenerator(TypedProperties config) {
+    super(config);
+    this.recordKeyFields = Arrays.asList(config.getString(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY()).split(","));
+  }
+
+  @Override
+  public HoodieKey getKey(GenericRecord record) {
 
 Review comment:
   @bschell : The only difference between GlobalDeleteKeyGenerator and ComplexKeyGenerator is that the former always creates an empty-partition path. right ? In that case, can we simply refactor the getKey() method in ComplexKeyGenerator and have GlobalDeleteKeyGenerator extend ComplexKeyGenerator with necessary changes to make it to work for empty partition-path. The advantage is we use all the logic related  to nested fields handling in one place ? Let me know your thoughts.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194#discussion_r365045991
 
 

 ##########
 File path: hudi-spark/src/main/java/org/apache/hudi/keygen/GlobalDeleteKeyGenerator.java
 ##########
 @@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import java.util.Arrays;
+import java.util.List;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.exception.HoodieKeyException;
+
+/**
+ * Key generator for deletes using global indices. Global index deletes do not require partition value
+ * so this key generator avoids using partition value for generating HoodieKey.
+ */
+public class GlobalDeleteKeyGenerator extends KeyGenerator {
+
+  private static final String EMPTY_PARTITION = "";
+  private static final String NULL_RECORDKEY_PLACEHOLDER = "__null__";
+  private static final String EMPTY_RECORDKEY_PLACEHOLDER = "__empty__";
+
+  protected final List<String> recordKeyFields;
+
+  public GlobalDeleteKeyGenerator(TypedProperties config) {
+    super(config);
+    this.recordKeyFields = Arrays.asList(config.getString(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY()).split(","));
+  }
+
+  @Override
+  public HoodieKey getKey(GenericRecord record) {
+    if (recordKeyFields == null) {
+      throw new HoodieKeyException("Unable to find field names for record key or partition path in cfg");
+    }
+
+    boolean keyIsNullEmpty = true;
+    StringBuilder recordKey = new StringBuilder();
+    for (String recordKeyField : recordKeyFields) {
+      String recordKeyValue = DataSourceUtils.getNestedFieldValAsString(record, recordKeyField, true);
+      if (recordKeyValue == null) {
 
 Review comment:
   Can you use this : https://github.com/apache/incubator-hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/util/StringUtils.java#L67 ?
   And have the throws Exception in the else part that way you won't need an extra variable to track this.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194#discussion_r363554374
 
 

 ##########
 File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
 ##########
 @@ -161,7 +162,17 @@ private[hudi] object HoodieSparkSqlWriter {
       // Convert to RDD[HoodieKey]
       val keyGenerator = DataSourceUtils.createKeyGenerator(toProperties(parameters))
       val genericRecords: RDD[GenericRecord] = AvroConversionUtils.createRdd(df, structName, nameSpace)
-      val hoodieKeysToDelete = genericRecords.map(gr => keyGenerator.getKey(gr)).toJavaRDD()
+      val hoodieKeysToDelete = if (HoodieIndex.GLOBAL_INDICES.contains(parameters(HoodieIndexConfig.INDEX_TYPE_PROP))) {
 
 Review comment:
   There is a NonPartitionedKeyGenerator that you can pass and use for deletions. Since as a client you are aware of 2 facts a) Either you are using GLOBAL_INDEX or not b) If this batch is for Deletions, you can configure the hudi client accordingly with a correct partition generator -> https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/NonpartitionedKeyGenerator.java

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bschell commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key

Posted by GitBox <gi...@apache.org>.
bschell commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194#discussion_r363869991
 
 

 ##########
 File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
 ##########
 @@ -161,7 +162,17 @@ private[hudi] object HoodieSparkSqlWriter {
       // Convert to RDD[HoodieKey]
       val keyGenerator = DataSourceUtils.createKeyGenerator(toProperties(parameters))
       val genericRecords: RDD[GenericRecord] = AvroConversionUtils.createRdd(df, structName, nameSpace)
-      val hoodieKeysToDelete = genericRecords.map(gr => keyGenerator.getKey(gr)).toJavaRDD()
+      val hoodieKeysToDelete = if (HoodieIndex.GLOBAL_INDICES.contains(parameters(HoodieIndexConfig.INDEX_TYPE_PROP))) {
 
 Review comment:
   @n3nash thanks for bringing that keyGenerator to my attention. That better serves my use case. Do you think extending it into a "GlobalDeleteKeyGenerator" would be a valid change instead?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194#discussion_r365478024
 
 

 ##########
 File path: hudi-spark/src/main/java/org/apache/hudi/keygen/GlobalDeleteKeyGenerator.java
 ##########
 @@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import java.util.Arrays;
+import java.util.List;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.exception.HoodieKeyException;
+
+/**
+ * Key generator for deletes using global indices. Global index deletes do not require partition value
+ * so this key generator avoids using partition value for generating HoodieKey.
+ */
+public class GlobalDeleteKeyGenerator extends KeyGenerator {
+
+  private static final String EMPTY_PARTITION = "";
+  private static final String NULL_RECORDKEY_PLACEHOLDER = "__null__";
+  private static final String EMPTY_RECORDKEY_PLACEHOLDER = "__empty__";
+
+  protected final List<String> recordKeyFields;
+
+  public GlobalDeleteKeyGenerator(TypedProperties config) {
+    super(config);
+    this.recordKeyFields = Arrays.asList(config.getString(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY()).split(","));
+  }
+
+  @Override
+  public HoodieKey getKey(GenericRecord record) {
+    if (recordKeyFields == null) {
+      throw new HoodieKeyException("Unable to find field names for record key or partition path in cfg");
+    }
+
+    boolean keyIsNullEmpty = true;
+    StringBuilder recordKey = new StringBuilder();
+    for (String recordKeyField : recordKeyFields) {
+      String recordKeyValue = DataSourceUtils.getNestedFieldValAsString(record, recordKeyField, true);
+      if (recordKeyValue == null) {
 
 Review comment:
   Interesting, I think we should fix the ComplexKeyGenerator as well, writing extra bytes like "__null__" just to denote null values doesn't seem like the best idea for data lakes. 
   Can you open a ticket for addressing that ? We can for now keep this implementation consistent with ComplexKeyGenerator. Once you open the ticket, please link it here and then I can merge this PR, thanks

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bschell commented on issue #1194: [HUDI-326] Add support to delete records with only record_key

Posted by GitBox <gi...@apache.org>.
bschell commented on issue #1194: [HUDI-326] Add support to delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194#issuecomment-574421754
 
 
   @n3nash Here is the JIRA: https://issues.apache.org/jira/browse/HUDI-536 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194#discussion_r363554374
 
 

 ##########
 File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
 ##########
 @@ -161,7 +162,17 @@ private[hudi] object HoodieSparkSqlWriter {
       // Convert to RDD[HoodieKey]
       val keyGenerator = DataSourceUtils.createKeyGenerator(toProperties(parameters))
       val genericRecords: RDD[GenericRecord] = AvroConversionUtils.createRdd(df, structName, nameSpace)
-      val hoodieKeysToDelete = genericRecords.map(gr => keyGenerator.getKey(gr)).toJavaRDD()
+      val hoodieKeysToDelete = if (HoodieIndex.GLOBAL_INDICES.contains(parameters(HoodieIndexConfig.INDEX_TYPE_PROP))) {
 
 Review comment:
   There is a NonPartitionedKeyGenerator that you can pass and use for deletions. Since as a client you are aware of 2 facts a) Either you are using GLOBAL_INDEX or not b) If this batch is for Deletions, you can configure the hudi client accordingly with a correct partition generator -> https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/NonpartitionedKeyGenerator.java
   May be we can extend this class to reflect the correct key generation strategy for deletions with a more reasonable explicit name 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on issue #1194: [HUDI-326] Add support to delete records with only record_key

Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #1194: [HUDI-326] Add support to delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194#issuecomment-574314023
 
 
   @bschell Once you open up the JIRA for the backwards compatible package refactor, we can merge this.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bschell commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key

Posted by GitBox <gi...@apache.org>.
bschell commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194#discussion_r365459763
 
 

 ##########
 File path: hudi-spark/src/main/java/org/apache/hudi/keygen/GlobalDeleteKeyGenerator.java
 ##########
 @@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import java.util.Arrays;
+import java.util.List;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.exception.HoodieKeyException;
+
+/**
+ * Key generator for deletes using global indices. Global index deletes do not require partition value
+ * so this key generator avoids using partition value for generating HoodieKey.
+ */
+public class GlobalDeleteKeyGenerator extends KeyGenerator {
+
+  private static final String EMPTY_PARTITION = "";
+  private static final String NULL_RECORDKEY_PLACEHOLDER = "__null__";
+  private static final String EMPTY_RECORDKEY_PLACEHOLDER = "__empty__";
+
+  protected final List<String> recordKeyFields;
+
+  public GlobalDeleteKeyGenerator(TypedProperties config) {
+    super(config);
+    this.recordKeyFields = Arrays.asList(config.getString(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY()).split(","));
+  }
+
+  @Override
+  public HoodieKey getKey(GenericRecord record) {
+    if (recordKeyFields == null) {
+      throw new HoodieKeyException("Unable to find field names for record key or partition path in cfg");
+    }
+
+    boolean keyIsNullEmpty = true;
+    StringBuilder recordKey = new StringBuilder();
+    for (String recordKeyField : recordKeyFields) {
+      String recordKeyValue = DataSourceUtils.getNestedFieldValAsString(record, recordKeyField, true);
+      if (recordKeyValue == null) {
 
 Review comment:
   For more context, am trying to maintain record_key compatibility with
   https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/ComplexKeyGenerator.java 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bschell commented on issue #1194: [HUDI-326] Add support to delete records with only record_key

Posted by GitBox <gi...@apache.org>.
bschell commented on issue #1194: [HUDI-326] Add support to delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194#issuecomment-573269632
 
 
   @n3nash Squashed the commits and updated PR. Also created the ticket here: https://issues.apache.org/jira/browse/HUDI-520 to decide on the approach for null/empty handling.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194#discussion_r363554003
 
 

 ##########
 File path: hudi-client/src/main/java/org/apache/hudi/index/HoodieIndex.java
 ##########
 @@ -43,6 +44,8 @@
  */
 public abstract class HoodieIndex<T extends HoodieRecordPayload> implements Serializable {
 
+  public static EnumSet<IndexType> GLOBAL_INDICES = EnumSet.of(IndexType.GLOBAL_BLOOM, IndexType.HBASE);
 
 Review comment:
   This is not very desirable since this is only required for deletion purposes and does not add any other value. Each implementation of the Index already has a method that describes whether it's global or not. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bschell commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key

Posted by GitBox <gi...@apache.org>.
bschell commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194#discussion_r365459763
 
 

 ##########
 File path: hudi-spark/src/main/java/org/apache/hudi/keygen/GlobalDeleteKeyGenerator.java
 ##########
 @@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import java.util.Arrays;
+import java.util.List;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.exception.HoodieKeyException;
+
+/**
+ * Key generator for deletes using global indices. Global index deletes do not require partition value
+ * so this key generator avoids using partition value for generating HoodieKey.
+ */
+public class GlobalDeleteKeyGenerator extends KeyGenerator {
+
+  private static final String EMPTY_PARTITION = "";
+  private static final String NULL_RECORDKEY_PLACEHOLDER = "__null__";
+  private static final String EMPTY_RECORDKEY_PLACEHOLDER = "__empty__";
+
+  protected final List<String> recordKeyFields;
+
+  public GlobalDeleteKeyGenerator(TypedProperties config) {
+    super(config);
+    this.recordKeyFields = Arrays.asList(config.getString(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY()).split(","));
+  }
+
+  @Override
+  public HoodieKey getKey(GenericRecord record) {
+    if (recordKeyFields == null) {
+      throw new HoodieKeyException("Unable to find field names for record key or partition path in cfg");
+    }
+
+    boolean keyIsNullEmpty = true;
+    StringBuilder recordKey = new StringBuilder();
+    for (String recordKeyField : recordKeyFields) {
+      String recordKeyValue = DataSourceUtils.getNestedFieldValAsString(record, recordKeyField, true);
+      if (recordKeyValue == null) {
 
 Review comment:
   @n3nash For more context, am trying to maintain record_key compatibility with
   https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/ComplexKeyGenerator.java 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194#discussion_r366506053
 
 

 ##########
 File path: hudi-spark/src/main/java/org/apache/hudi/keygen/GlobalDeleteKeyGenerator.java
 ##########
 @@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import java.util.Arrays;
+import java.util.List;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.exception.HoodieKeyException;
+
+/**
+ * Key generator for deletes using global indices. Global index deletes do not require partition value
+ * so this key generator avoids using partition value for generating HoodieKey.
+ */
+public class GlobalDeleteKeyGenerator extends KeyGenerator {
+
+  private static final String EMPTY_PARTITION = "";
+  private static final String NULL_RECORDKEY_PLACEHOLDER = "__null__";
+  private static final String EMPTY_RECORDKEY_PLACEHOLDER = "__empty__";
+
+  protected final List<String> recordKeyFields;
+
+  public GlobalDeleteKeyGenerator(TypedProperties config) {
+    super(config);
+    this.recordKeyFields = Arrays.asList(config.getString(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY()).split(","));
+  }
+
+  @Override
+  public HoodieKey getKey(GenericRecord record) {
 
 Review comment:
   I think the point was to make the GlobalDeleteKeyGenerator work for any case, complex or simple. 
   @bvaradar 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash merged pull request #1194: [HUDI-326] Add support to delete records with only record_key

Posted by GitBox <gi...@apache.org>.
n3nash merged pull request #1194: [HUDI-326] Add support to delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bschell commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key

Posted by GitBox <gi...@apache.org>.
bschell commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194#discussion_r365062845
 
 

 ##########
 File path: hudi-spark/src/main/java/org/apache/hudi/keygen/GlobalDeleteKeyGenerator.java
 ##########
 @@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import java.util.Arrays;
+import java.util.List;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.exception.HoodieKeyException;
+
+/**
+ * Key generator for deletes using global indices. Global index deletes do not require partition value
+ * so this key generator avoids using partition value for generating HoodieKey.
+ */
+public class GlobalDeleteKeyGenerator extends KeyGenerator {
+
+  private static final String EMPTY_PARTITION = "";
+  private static final String NULL_RECORDKEY_PLACEHOLDER = "__null__";
+  private static final String EMPTY_RECORDKEY_PLACEHOLDER = "__empty__";
+
+  protected final List<String> recordKeyFields;
+
+  public GlobalDeleteKeyGenerator(TypedProperties config) {
+    super(config);
+    this.recordKeyFields = Arrays.asList(config.getString(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY()).split(","));
+  }
+
+  @Override
+  public HoodieKey getKey(GenericRecord record) {
+    if (recordKeyFields == null) {
+      throw new HoodieKeyException("Unable to find field names for record key or partition path in cfg");
+    }
+
+    boolean keyIsNullEmpty = true;
+    StringBuilder recordKey = new StringBuilder();
+    for (String recordKeyField : recordKeyFields) {
+      String recordKeyValue = DataSourceUtils.getNestedFieldValAsString(record, recordKeyField, true);
+      if (recordKeyValue == null) {
 
 Review comment:
   Will change to this

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194#discussion_r366505359
 
 

 ##########
 File path: hudi-spark/src/main/java/org/apache/hudi/keygen/ComplexKeyGenerator.java
 ##########
 @@ -16,8 +16,10 @@
  * limitations under the License.
  */
 
-package org.apache.hudi;
+package org.apache.hudi.keygen;
 
 Review comment:
   +1 @bschell 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bschell commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key

Posted by GitBox <gi...@apache.org>.
bschell commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194#discussion_r363869991
 
 

 ##########
 File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
 ##########
 @@ -161,7 +162,17 @@ private[hudi] object HoodieSparkSqlWriter {
       // Convert to RDD[HoodieKey]
       val keyGenerator = DataSourceUtils.createKeyGenerator(toProperties(parameters))
       val genericRecords: RDD[GenericRecord] = AvroConversionUtils.createRdd(df, structName, nameSpace)
-      val hoodieKeysToDelete = genericRecords.map(gr => keyGenerator.getKey(gr)).toJavaRDD()
+      val hoodieKeysToDelete = if (HoodieIndex.GLOBAL_INDICES.contains(parameters(HoodieIndexConfig.INDEX_TYPE_PROP))) {
 
 Review comment:
   @n3nash thanks for bringing that keyGenerator to my attention. That better serves my use case. Do you think extending it into a "GlobalDeleteKeyGenerator" would be a valid change instead? Or possibly creating an entirely new keyGenerator for global deletes instead, because the nonpartitionedKeyGenerator would not work if you are using the complex key generator.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194#discussion_r363910472
 
 

 ##########
 File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
 ##########
 @@ -161,7 +162,17 @@ private[hudi] object HoodieSparkSqlWriter {
       // Convert to RDD[HoodieKey]
       val keyGenerator = DataSourceUtils.createKeyGenerator(toProperties(parameters))
       val genericRecords: RDD[GenericRecord] = AvroConversionUtils.createRdd(df, structName, nameSpace)
-      val hoodieKeysToDelete = genericRecords.map(gr => keyGenerator.getKey(gr)).toJavaRDD()
+      val hoodieKeysToDelete = if (HoodieIndex.GLOBAL_INDICES.contains(parameters(HoodieIndexConfig.INDEX_TYPE_PROP))) {
 
 Review comment:
   I think just create an entirely new keyGenerator since tables with GlobalIndex can still be partitioned, all we are doing is here allowing us to pass just the record_keys to be able to delete those records.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key

Posted by GitBox <gi...@apache.org>.
n3nash commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194#discussion_r363554374
 
 

 ##########
 File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
 ##########
 @@ -161,7 +162,17 @@ private[hudi] object HoodieSparkSqlWriter {
       // Convert to RDD[HoodieKey]
       val keyGenerator = DataSourceUtils.createKeyGenerator(toProperties(parameters))
       val genericRecords: RDD[GenericRecord] = AvroConversionUtils.createRdd(df, structName, nameSpace)
-      val hoodieKeysToDelete = genericRecords.map(gr => keyGenerator.getKey(gr)).toJavaRDD()
+      val hoodieKeysToDelete = if (HoodieIndex.GLOBAL_INDICES.contains(parameters(HoodieIndexConfig.INDEX_TYPE_PROP))) {
 
 Review comment:
   There is a NonPartitionedKeyGenerator that you can pass and use for deletions. Since as a client you are aware of 2 facts a) Either you are using GLOBAL_INDEX or not b) If this batch is for Deletions, you can configure the hudi client accordingly with a correct partition generator -> https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/NonpartitionedKeyGenerator.java
   May be we can extend this class to reflect the correct key generation strategy for deletions to be explicit 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bschell commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key

Posted by GitBox <gi...@apache.org>.
bschell commented on a change in pull request #1194: [HUDI-326] Add support to delete records with only record_key
URL: https://github.com/apache/incubator-hudi/pull/1194#discussion_r365490837
 
 

 ##########
 File path: hudi-spark/src/main/java/org/apache/hudi/keygen/GlobalDeleteKeyGenerator.java
 ##########
 @@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import java.util.Arrays;
+import java.util.List;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.exception.HoodieKeyException;
+
+/**
+ * Key generator for deletes using global indices. Global index deletes do not require partition value
+ * so this key generator avoids using partition value for generating HoodieKey.
+ */
+public class GlobalDeleteKeyGenerator extends KeyGenerator {
+
+  private static final String EMPTY_PARTITION = "";
+  private static final String NULL_RECORDKEY_PLACEHOLDER = "__null__";
+  private static final String EMPTY_RECORDKEY_PLACEHOLDER = "__empty__";
+
+  protected final List<String> recordKeyFields;
+
+  public GlobalDeleteKeyGenerator(TypedProperties config) {
+    super(config);
+    this.recordKeyFields = Arrays.asList(config.getString(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY()).split(","));
+  }
+
+  @Override
+  public HoodieKey getKey(GenericRecord record) {
+    if (recordKeyFields == null) {
+      throw new HoodieKeyException("Unable to find field names for record key or partition path in cfg");
+    }
+
+    boolean keyIsNullEmpty = true;
+    StringBuilder recordKey = new StringBuilder();
+    for (String recordKeyField : recordKeyFields) {
+      String recordKeyValue = DataSourceUtils.getNestedFieldValAsString(record, recordKeyField, true);
+      if (recordKeyValue == null) {
 
 Review comment:
   good point, will do.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services