You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "xushiyan (via GitHub)" <gi...@apache.org> on 2023/02/10 02:01:13 UTC

[GitHub] [hudi] xushiyan opened a new pull request, #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

xushiyan opened a new pull request, #7914:
URL: https://github.com/apache/hudi/pull/7914

   ### Change Logs
   
   Track persisted RDDs in `HoodieEngineContext` so that it can be used to filter which RDD to be unpersisted.
   
   ### Impact
   
   NA
   
   ### Risk level
   
   Low.
   
   ### Documentation Update
   
   NA
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7914:
URL: https://github.com/apache/hudi/pull/7914#issuecomment-1436161075

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0acb530ba86094563364881c54530479588769a6",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15106",
       "triggerID" : "0acb530ba86094563364881c54530479588769a6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "533c544d2325954818f7d96b7f91f6dc3748d61a",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15284",
       "triggerID" : "533c544d2325954818f7d96b7f91f6dc3748d61a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 0acb530ba86094563364881c54530479588769a6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15106) 
   * 533c544d2325954818f7d96b7f91f6dc3748d61a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15284) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7914:
URL: https://github.com/apache/hudi/pull/7914#issuecomment-1442503094

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0acb530ba86094563364881c54530479588769a6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15106",
       "triggerID" : "0acb530ba86094563364881c54530479588769a6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "533c544d2325954818f7d96b7f91f6dc3748d61a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15284",
       "triggerID" : "533c544d2325954818f7d96b7f91f6dc3748d61a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ebff5c5a2399351d21de4eb80e3437749c6d1209",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15289",
       "triggerID" : "ebff5c5a2399351d21de4eb80e3437749c6d1209",
       "triggerType" : "PUSH"
     }, {
       "hash" : "94ad2976cd88f284df074090b4da639d0a2eeeab",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "94ad2976cd88f284df074090b4da639d0a2eeeab",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ebff5c5a2399351d21de4eb80e3437749c6d1209 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15289) 
   * 94ad2976cd88f284df074090b4da639d0a2eeeab UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7914:
URL: https://github.com/apache/hudi/pull/7914#issuecomment-1426870103

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0acb530ba86094563364881c54530479588769a6",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "0acb530ba86094563364881c54530479588769a6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 0acb530ba86094563364881c54530479588769a6 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7914:
URL: https://github.com/apache/hudi/pull/7914#issuecomment-1447290409

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "94ad2976cd88f284df074090b4da639d0a2eeeab",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15370",
       "triggerID" : "94ad2976cd88f284df074090b4da639d0a2eeeab",
       "triggerType" : "PUSH"
     }, {
       "hash" : "22e5065d585bacd3e4f13cd9c1db039a10d85d6b",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15451",
       "triggerID" : "22e5065d585bacd3e4f13cd9c1db039a10d85d6b",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 94ad2976cd88f284df074090b4da639d0a2eeeab Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15370) 
   * 22e5065d585bacd3e4f13cd9c1db039a10d85d6b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15451) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7914:
URL: https://github.com/apache/hudi/pull/7914#issuecomment-1426107317

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "15f5bae787294c0509a8e7b849132f08080c59cc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15073",
       "triggerID" : "15f5bae787294c0509a8e7b849132f08080c59cc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "319132092c5c1521ff11a2100bd325e5a280459f",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "319132092c5c1521ff11a2100bd325e5a280459f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0d0da3dd9478911aca5e4c00148d26cc3e1e93f5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15081",
       "triggerID" : "0d0da3dd9478911aca5e4c00148d26cc3e1e93f5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "66780f5afc8835e99c9f0e81b5b9650003888447",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15090",
       "triggerID" : "66780f5afc8835e99c9f0e81b5b9650003888447",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 319132092c5c1521ff11a2100bd325e5a280459f UNKNOWN
   * 66780f5afc8835e99c9f0e81b5b9650003888447 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15090) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7914:
URL: https://github.com/apache/hudi/pull/7914#issuecomment-1426411914

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "15f5bae787294c0509a8e7b849132f08080c59cc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15073",
       "triggerID" : "15f5bae787294c0509a8e7b849132f08080c59cc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "319132092c5c1521ff11a2100bd325e5a280459f",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "319132092c5c1521ff11a2100bd325e5a280459f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0d0da3dd9478911aca5e4c00148d26cc3e1e93f5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15081",
       "triggerID" : "0d0da3dd9478911aca5e4c00148d26cc3e1e93f5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "66780f5afc8835e99c9f0e81b5b9650003888447",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15090",
       "triggerID" : "66780f5afc8835e99c9f0e81b5b9650003888447",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b4471586094e6549c793b38276bab2b4907f2ab1",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15095",
       "triggerID" : "b4471586094e6549c793b38276bab2b4907f2ab1",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 319132092c5c1521ff11a2100bd325e5a280459f UNKNOWN
   * 66780f5afc8835e99c9f0e81b5b9650003888447 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15090) 
   * b4471586094e6549c793b38276bab2b4907f2ab1 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15095) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7914:
URL: https://github.com/apache/hudi/pull/7914#issuecomment-1425582221

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "15f5bae787294c0509a8e7b849132f08080c59cc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15073",
       "triggerID" : "15f5bae787294c0509a8e7b849132f08080c59cc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "319132092c5c1521ff11a2100bd325e5a280459f",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "319132092c5c1521ff11a2100bd325e5a280459f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0d0da3dd9478911aca5e4c00148d26cc3e1e93f5",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15081",
       "triggerID" : "0d0da3dd9478911aca5e4c00148d26cc3e1e93f5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "66780f5afc8835e99c9f0e81b5b9650003888447",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "66780f5afc8835e99c9f0e81b5b9650003888447",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 319132092c5c1521ff11a2100bd325e5a280459f UNKNOWN
   * 0d0da3dd9478911aca5e4c00148d26cc3e1e93f5 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15081) 
   * 66780f5afc8835e99c9f0e81b5b9650003888447 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on code in PR #7914:
URL: https://github.com/apache/hudi/pull/7914#discussion_r1106180368


##########
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/TestSparkRDDWriteClient.java:
##########
@@ -0,0 +1,123 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.client;
+
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.testutils.HoodieTestDataGenerator;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.testutils.SparkClientFunctionalTestHarness;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.storage.StorageLevel;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
+
+import java.io.IOException;
+import java.net.URI;
+import java.util.Collections;
+import java.util.List;
+import java.util.Properties;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static org.apache.hudi.common.testutils.HoodieTestDataGenerator.getCommitTimeAtUTC;
+import static org.apache.hudi.testutils.Assertions.assertNoWriteErrors;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+class TestSparkRDDWriteClient extends SparkClientFunctionalTestHarness {
+
+  static Stream<Arguments> testWriteClientReleaseResourcesShouldOnlyUnpersistRelevantRdds() {
+    return Stream.of(
+        Arguments.of(HoodieTableType.COPY_ON_WRITE, true),
+        Arguments.of(HoodieTableType.MERGE_ON_READ, true),
+        Arguments.of(HoodieTableType.COPY_ON_WRITE, false),
+        Arguments.of(HoodieTableType.MERGE_ON_READ, false)
+    );
+  }
+
+  @ParameterizedTest
+  @MethodSource
+  void testWriteClientReleaseResourcesShouldOnlyUnpersistRelevantRdds(HoodieTableType tableType, boolean shouldReleaseResource) throws IOException {
+    final HoodieTableMetaClient metaClient = getHoodieMetaClient(hadoopConf(), URI.create(basePath()).getPath(), tableType, new Properties());
+    final HoodieWriteConfig writeConfig = getConfigBuilder(true)
+        .withPath(metaClient.getBasePathV2().toString())
+        .withAutoCommit(false)
+        .withReleaseResourceEnabled(shouldReleaseResource)
+        .withMetadataConfig(HoodieMetadataConfig.newBuilder().enable(false).build())
+        .build();
+    HoodieTestDataGenerator dataGen = new HoodieTestDataGenerator(0xDEED);
+
+    String instant0 = getCommitTimeAtUTC(0);
+    List<GenericRecord> extraRecords0 = dataGen.generateGenericRecords(10);
+    JavaRDD persistedRdd0 = jsc().parallelize(extraRecords0, 2).persist(StorageLevel.MEMORY_AND_DISK());
+    context().putCachedDataIds(writeConfig.getBasePath(), instant0, persistedRdd0.id());
+
+    String instant1 = getCommitTimeAtUTC(1);
+    List<GenericRecord> extraRecords1 = dataGen.generateGenericRecords(10);
+    JavaRDD persistedRdd1 = jsc().parallelize(extraRecords1, 2).persist(StorageLevel.MEMORY_AND_DISK());
+    context().putCachedDataIds(writeConfig.getBasePath(), instant1, persistedRdd1.id());
+
+    SparkRDDWriteClient writeClient = getHoodieWriteClient(writeConfig);
+    List<HoodieRecord> records = dataGen.generateInserts(instant1, 10);
+    JavaRDD<HoodieRecord> writeRecords = jsc().parallelize(records, 2);
+    writeClient.startCommitWithTime(instant1);
+    List<WriteStatus> writeStatuses = writeClient.insert(writeRecords, instant1).collect();
+    assertNoWriteErrors(writeStatuses);
+    writeClient.commitStats(instant1, writeStatuses.stream().map(WriteStatus::getStat).collect(Collectors.toList()),
+        Option.empty(), metaClient.getCommitActionType());
+    writeClient.close();
+
+    if (shouldReleaseResource) {
+      assertEquals(Collections.singletonList(persistedRdd0.id()),
+          context().getCachedDataIds(writeConfig.getBasePath(), instant0),
+          "RDDs cached for " + instant0 + " should be retained.");
+      assertEquals(Collections.emptyList(),

Review Comment:
   minor. you can create two lists. expectedToRetain and expectedToCleared. and assert within a for loop for entires in the list. will reduce LOC



##########
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/TestSparkRDDWriteClient.java:
##########
@@ -0,0 +1,123 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.client;
+
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.testutils.HoodieTestDataGenerator;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.testutils.SparkClientFunctionalTestHarness;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.storage.StorageLevel;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
+
+import java.io.IOException;
+import java.net.URI;
+import java.util.Collections;
+import java.util.List;
+import java.util.Properties;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static org.apache.hudi.common.testutils.HoodieTestDataGenerator.getCommitTimeAtUTC;
+import static org.apache.hudi.testutils.Assertions.assertNoWriteErrors;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+class TestSparkRDDWriteClient extends SparkClientFunctionalTestHarness {
+
+  static Stream<Arguments> testWriteClientReleaseResourcesShouldOnlyUnpersistRelevantRdds() {
+    return Stream.of(
+        Arguments.of(HoodieTableType.COPY_ON_WRITE, true),
+        Arguments.of(HoodieTableType.MERGE_ON_READ, true),
+        Arguments.of(HoodieTableType.COPY_ON_WRITE, false),
+        Arguments.of(HoodieTableType.MERGE_ON_READ, false)
+    );
+  }
+
+  @ParameterizedTest
+  @MethodSource
+  void testWriteClientReleaseResourcesShouldOnlyUnpersistRelevantRdds(HoodieTableType tableType, boolean shouldReleaseResource) throws IOException {
+    final HoodieTableMetaClient metaClient = getHoodieMetaClient(hadoopConf(), URI.create(basePath()).getPath(), tableType, new Properties());
+    final HoodieWriteConfig writeConfig = getConfigBuilder(true)
+        .withPath(metaClient.getBasePathV2().toString())
+        .withAutoCommit(false)
+        .withReleaseResourceEnabled(shouldReleaseResource)
+        .withMetadataConfig(HoodieMetadataConfig.newBuilder().enable(false).build())
+        .build();
+    HoodieTestDataGenerator dataGen = new HoodieTestDataGenerator(0xDEED);
+
+    String instant0 = getCommitTimeAtUTC(0);
+    List<GenericRecord> extraRecords0 = dataGen.generateGenericRecords(10);
+    JavaRDD persistedRdd0 = jsc().parallelize(extraRecords0, 2).persist(StorageLevel.MEMORY_AND_DISK());
+    context().putCachedDataIds(writeConfig.getBasePath(), instant0, persistedRdd0.id());
+
+    String instant1 = getCommitTimeAtUTC(1);
+    List<GenericRecord> extraRecords1 = dataGen.generateGenericRecords(10);
+    JavaRDD persistedRdd1 = jsc().parallelize(extraRecords1, 2).persist(StorageLevel.MEMORY_AND_DISK());
+    context().putCachedDataIds(writeConfig.getBasePath(), instant1, persistedRdd1.id());
+
+    SparkRDDWriteClient writeClient = getHoodieWriteClient(writeConfig);
+    List<HoodieRecord> records = dataGen.generateInserts(instant1, 10);
+    JavaRDD<HoodieRecord> writeRecords = jsc().parallelize(records, 2);
+    writeClient.startCommitWithTime(instant1);
+    List<WriteStatus> writeStatuses = writeClient.insert(writeRecords, instant1).collect();
+    assertNoWriteErrors(writeStatuses);
+    writeClient.commitStats(instant1, writeStatuses.stream().map(WriteStatus::getStat).collect(Collectors.toList()),
+        Option.empty(), metaClient.getCommitActionType());
+    writeClient.close();
+
+    if (shouldReleaseResource) {
+      assertEquals(Collections.singletonList(persistedRdd0.id()),
+          context().getCachedDataIds(writeConfig.getBasePath(), instant0),
+          "RDDs cached for " + instant0 + " should be retained.");
+      assertEquals(Collections.emptyList(),
+          context().getCachedDataIds(writeConfig.getBasePath(), instant1),
+          "RDDs cached for " + instant1 + " should be cleared.");
+      assertTrue(jsc().getPersistentRDDs().containsKey(persistedRdd0.id()),
+          "RDDs cached for " + instant0 + " should be retained.");
+      assertFalse(jsc().getPersistentRDDs().containsKey(persistedRdd1.id()),
+          "RDDs cached for " + instant1 + " should be cleared.");
+      assertFalse(jsc().getPersistentRDDs().containsKey(writeRecords.id()),
+          "RDDs cached for " + instant1 + " should be cleared.");
+    } else {
+      assertEquals(Collections.singletonList(persistedRdd0.id()),
+          context().getCachedDataIds(writeConfig.getBasePath(), instant0),
+          "RDDs cached for " + instant0 + " should be retained.");
+      assertEquals(3,
+          context().getCachedDataIds(writeConfig.getBasePath(), instant1).size(),

Review Comment:
   same here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "alexeykudinkin (via GitHub)" <gi...@apache.org>.
alexeykudinkin commented on code in PR #7914:
URL: https://github.com/apache/hudi/pull/7914#discussion_r1106414946


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/common/HoodieSparkEngineContext.java:
##########
@@ -58,6 +61,7 @@ public class HoodieSparkEngineContext extends HoodieEngineContext {
   private static final Logger LOG = LogManager.getLogger(HoodieSparkEngineContext.class);
   private final JavaSparkContext javaSparkContext;
   private final SQLContext sqlContext;
+  private final Map<Pair<String, String>, List<Integer>> cachedRddIds = new ConcurrentHashMap<>();

Review Comment:
   Let's add a comment elaborating why key is (basePath, instant)
   



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BaseCommitActionExecutor.java:
##########
@@ -246,6 +246,7 @@ protected HoodieWriteMetadata<HoodieData<WriteStatus>> executeClustering(HoodieC
         .performClustering(clusteringPlan, schema, instantTime);
     HoodieData<WriteStatus> writeStatusList = writeMetadata.getWriteStatuses();
     HoodieData<WriteStatus> statuses = updateIndex(writeStatusList, writeMetadata);
+    context.putCachedDataIds(config.getBasePath(), instantTime, statuses.getId());

Review Comment:
   This is quite brittle -- it's far from obvious that we need to persist cached RDD ids somewhere. 
   I'd suggest we instead modify `HoodieData.persist` to accept context and this registration internally (so that we can establish it as an invariant that any persisted RDD will be registered)



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/common/HoodieSparkEngineContext.java:
##########
@@ -180,4 +184,24 @@ public Option<String> getProperty(EngineProperty key) {
   public void setJobStatus(String activeModule, String activityDescription) {
     javaSparkContext.setJobGroup(activeModule, activityDescription);
   }
+
+  @Override
+  public void putCachedDataIds(String basePath, String instantTime, int... ids) {
+    Pair<String, String> key = Pair.of(basePath, instantTime);
+    cachedRddIds.putIfAbsent(key, new ArrayList<>());

Review Comment:
   Since we're appending to ArrayList here we need to guard it w/ a lock (and since we'd have to grab lock anyways we can just use HashMap)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7914:
URL: https://github.com/apache/hudi/pull/7914#issuecomment-1425594203

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "15f5bae787294c0509a8e7b849132f08080c59cc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15073",
       "triggerID" : "15f5bae787294c0509a8e7b849132f08080c59cc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "319132092c5c1521ff11a2100bd325e5a280459f",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "319132092c5c1521ff11a2100bd325e5a280459f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0d0da3dd9478911aca5e4c00148d26cc3e1e93f5",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15081",
       "triggerID" : "0d0da3dd9478911aca5e4c00148d26cc3e1e93f5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "66780f5afc8835e99c9f0e81b5b9650003888447",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15090",
       "triggerID" : "66780f5afc8835e99c9f0e81b5b9650003888447",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 319132092c5c1521ff11a2100bd325e5a280459f UNKNOWN
   * 0d0da3dd9478911aca5e4c00148d26cc3e1e93f5 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15081) 
   * 66780f5afc8835e99c9f0e81b5b9650003888447 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15090) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7914:
URL: https://github.com/apache/hudi/pull/7914#issuecomment-1436298180

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0acb530ba86094563364881c54530479588769a6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15106",
       "triggerID" : "0acb530ba86094563364881c54530479588769a6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "533c544d2325954818f7d96b7f91f6dc3748d61a",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15284",
       "triggerID" : "533c544d2325954818f7d96b7f91f6dc3748d61a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ebff5c5a2399351d21de4eb80e3437749c6d1209",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "ebff5c5a2399351d21de4eb80e3437749c6d1209",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 533c544d2325954818f7d96b7f91f6dc3748d61a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15284) 
   * ebff5c5a2399351d21de4eb80e3437749c6d1209 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7914:
URL: https://github.com/apache/hudi/pull/7914#issuecomment-1425086433

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "15f5bae787294c0509a8e7b849132f08080c59cc",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "15f5bae787294c0509a8e7b849132f08080c59cc",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 15f5bae787294c0509a8e7b849132f08080c59cc UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on code in PR #7914:
URL: https://github.com/apache/hudi/pull/7914#discussion_r1119346610


##########
hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/client/common/HoodieFlinkEngineContext.java:
##########
@@ -166,17 +167,17 @@ public void setJobStatus(String activeModule, String activityDescription) {
   }
 
   @Override
-  public void putCachedDataIds(String basePath, String instantTime, int... ids) {
+  public void putCachedDataIds(HoodieDataCacheKey cacheKey, int... ids) {

Review Comment:
   may be in a follow up patch. 
   why can't we support persisting and unpersisting for other engines too? 
   



##########
hudi-common/src/main/java/org/apache/hudi/common/data/HoodieData.java:
##########
@@ -56,10 +58,19 @@
   int getId();
 
   /**
-   * Persists the data w/ provided {@code level} (if applicable)
+   * Persists the data w/ provided {@code level} (if applicable).
+   *
+   * Use this method only when you call {@link #unpersist()} at some later point for the same {@link HoodieData}.
+   * Otherwise, use {@link #persist(String, HoodieEngineContext, HoodieDataCacheKey)} instead for auto-unpersist
+   * at the end of a client write operation.
    */
   void persist(String level);
 
+  /**
+   * Persists the data w/ provided {@code level} (if applicable), and cache it within the {@code engineContext}.

Review Comment:
   minor. 
   ".... and cache the Rdd ids within the ..." 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7914:
URL: https://github.com/apache/hudi/pull/7914#issuecomment-1442650574

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "94ad2976cd88f284df074090b4da639d0a2eeeab",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "94ad2976cd88f284df074090b4da639d0a2eeeab",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 94ad2976cd88f284df074090b4da639d0a2eeeab UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7914:
URL: https://github.com/apache/hudi/pull/7914#issuecomment-1442513046

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0acb530ba86094563364881c54530479588769a6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15106",
       "triggerID" : "0acb530ba86094563364881c54530479588769a6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "533c544d2325954818f7d96b7f91f6dc3748d61a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15284",
       "triggerID" : "533c544d2325954818f7d96b7f91f6dc3748d61a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ebff5c5a2399351d21de4eb80e3437749c6d1209",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15289",
       "triggerID" : "ebff5c5a2399351d21de4eb80e3437749c6d1209",
       "triggerType" : "PUSH"
     }, {
       "hash" : "94ad2976cd88f284df074090b4da639d0a2eeeab",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15370",
       "triggerID" : "94ad2976cd88f284df074090b4da639d0a2eeeab",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ebff5c5a2399351d21de4eb80e3437749c6d1209 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15289) 
   * 94ad2976cd88f284df074090b4da639d0a2eeeab Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15370) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "alexeykudinkin (via GitHub)" <gi...@apache.org>.
alexeykudinkin commented on code in PR #7914:
URL: https://github.com/apache/hudi/pull/7914#discussion_r1116333351


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/HoodieCompactor.java:
##########
@@ -83,7 +83,7 @@ public abstract void preCompact(
    *
    * @param writeStatus {@link HoodieData} of {@link WriteStatus}.
    */
-  public abstract void maybePersist(HoodieData<WriteStatus> writeStatus, HoodieWriteConfig config);
+  public abstract void maybePersist(HoodieData<WriteStatus> writeStatus, HoodieEngineContext context, HoodieWriteConfig config, String instantTime);

Review Comment:
   nit: Shall we place context as first arg (it's a convention)



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/data/HoodieJavaRDD.java:
##########
@@ -81,11 +84,22 @@ public static <K, V> JavaPairRDD<K, V> getJavaRDD(HoodiePairData<K, V> hoodieDat
     return ((HoodieJavaPairRDD<K, V>) hoodieData).get();
   }
 
+  @Override
+  public int getId() {
+    return rddData.id();
+  }
+
   @Override
   public void persist(String level) {
     rddData.persist(StorageLevel.fromString(level));
   }
 
+  @Override
+  public void persist(String level, HoodieEngineContext engineContext, HoodieDataCacheKey cacheKey) {

Review Comment:
   Why do we have 2 overrides now (one accepting context and one that doesn't)? 



##########
hudi-common/src/main/java/org/apache/hudi/common/data/HoodieData.java:
##########
@@ -196,4 +212,42 @@ default <O> HoodieData<T> distinctWithKey(SerializableFunction<T, O> keyGetter,
         .reduceByKey((value1, value2) -> value1, parallelism)
         .values();
   }
+
+  /**
+   * The key used in a caching map to identify a {@link HoodieData}.
+   *
+   * At the end of a write operation, we manually unpersist the {@link HoodieData} associated with that writer.
+   * Therefore, in multi-writer scenario, we need to use both {@code basePath} and {@code instantTime} to identify {@link HoodieData}s.
+   */
+  class HoodieDataCacheKey implements Serializable {

Review Comment:
   We should avoid exposing this outside of the `HoodieData` class (no other components should be exposed to how we're caching it, so it would be easier for us to change if we need to)



##########
hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/table/action/compact/HoodieFlinkMergeOnReadTableCompactor.java:
##########
@@ -55,7 +56,7 @@ public void preCompact(
   }
 
   @Override
-  public void maybePersist(HoodieData<WriteStatus> writeStatus, HoodieWriteConfig config) {
+  public void maybePersist(HoodieData<WriteStatus> writeStatus, HoodieEngineContext context, HoodieWriteConfig config, String instantTime) {

Review Comment:
   Same comment as above



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/common/HoodieSparkEngineContext.java:
##########
@@ -180,4 +187,29 @@ public Option<String> getProperty(EngineProperty key) {
   public void setJobStatus(String activeModule, String activityDescription) {
     javaSparkContext.setJobGroup(activeModule, activityDescription);
   }
+
+  @Override
+  public void putCachedDataIds(HoodieDataCacheKey cacheKey, int... ids) {
+    synchronized (cacheLock) {

Review Comment:
   No need for separate lock, we can synchronize on the cache itself



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/common/HoodieSparkEngineContext.java:
##########
@@ -180,4 +187,29 @@ public Option<String> getProperty(EngineProperty key) {
   public void setJobStatus(String activeModule, String activityDescription) {
     javaSparkContext.setJobGroup(activeModule, activityDescription);
   }
+
+  @Override
+  public void putCachedDataIds(HoodieDataCacheKey cacheKey, int... ids) {
+    synchronized (cacheLock) {

Review Comment:
   Let's also annotates this class as `@ThreadSafe`



##########
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/TestSparkRDDWriteClient.java:
##########
@@ -0,0 +1,124 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.client;
+
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.data.HoodieData.HoodieDataCacheKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.testutils.HoodieTestDataGenerator;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.data.HoodieJavaRDD;
+import org.apache.hudi.testutils.SparkClientFunctionalTestHarness;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.spark.api.java.JavaRDD;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
+
+import java.io.IOException;
+import java.net.URI;
+import java.util.Collections;
+import java.util.List;
+import java.util.Properties;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static org.apache.hudi.common.testutils.HoodieTestDataGenerator.getCommitTimeAtUTC;
+import static org.apache.hudi.testutils.Assertions.assertNoWriteErrors;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+class TestSparkRDDWriteClient extends SparkClientFunctionalTestHarness {
+
+  static Stream<Arguments> testWriteClientReleaseResourcesShouldOnlyUnpersistRelevantRdds() {
+    return Stream.of(
+        Arguments.of(HoodieTableType.COPY_ON_WRITE, true),
+        Arguments.of(HoodieTableType.MERGE_ON_READ, true),
+        Arguments.of(HoodieTableType.COPY_ON_WRITE, false),
+        Arguments.of(HoodieTableType.MERGE_ON_READ, false)
+    );
+  }
+
+  @ParameterizedTest
+  @MethodSource
+  void testWriteClientReleaseResourcesShouldOnlyUnpersistRelevantRdds(HoodieTableType tableType, boolean shouldReleaseResource) throws IOException {
+    final HoodieTableMetaClient metaClient = getHoodieMetaClient(hadoopConf(), URI.create(basePath()).getPath(), tableType, new Properties());
+    final HoodieWriteConfig writeConfig = getConfigBuilder(true)
+        .withPath(metaClient.getBasePathV2().toString())
+        .withAutoCommit(false)
+        .withReleaseResourceEnabled(shouldReleaseResource)
+        .withMetadataConfig(HoodieMetadataConfig.newBuilder().enable(false).build())
+        .build();
+    HoodieTestDataGenerator dataGen = new HoodieTestDataGenerator(0xDEED);
+
+    String instant0 = getCommitTimeAtUTC(0);
+    List<GenericRecord> extraRecords0 = dataGen.generateGenericRecords(10);
+    HoodieJavaRDD<GenericRecord> persistedRdd0 = HoodieJavaRDD.of(jsc().parallelize(extraRecords0, 2));
+    persistedRdd0.persist("MEMORY_AND_DISK", context(), HoodieDataCacheKey.of(writeConfig.getBasePath(), instant0));
+
+    String instant1 = getCommitTimeAtUTC(1);
+    List<GenericRecord> extraRecords1 = dataGen.generateGenericRecords(10);
+    HoodieJavaRDD<GenericRecord> persistedRdd1 = HoodieJavaRDD.of(jsc().parallelize(extraRecords1, 2));
+    persistedRdd1.persist("MEMORY_AND_DISK", context(), HoodieDataCacheKey.of(writeConfig.getBasePath(), instant1));
+
+    SparkRDDWriteClient writeClient = getHoodieWriteClient(writeConfig);
+    List<HoodieRecord> records = dataGen.generateInserts(instant1, 10);
+    JavaRDD<HoodieRecord> writeRecords = jsc().parallelize(records, 2);
+    writeClient.startCommitWithTime(instant1);
+    List<WriteStatus> writeStatuses = writeClient.insert(writeRecords, instant1).collect();
+    assertNoWriteErrors(writeStatuses);
+    writeClient.commitStats(instant1, writeStatuses.stream().map(WriteStatus::getStat).collect(Collectors.toList()),
+        Option.empty(), metaClient.getCommitActionType());
+    writeClient.close();
+
+    if (shouldReleaseResource) {
+      assertEquals(Collections.singletonList(persistedRdd0.getId()),
+          context().getCachedDataIds(HoodieDataCacheKey.of(writeConfig.getBasePath(), instant0)),
+          "RDDs cached for " + instant0 + " should be retained.");
+      assertEquals(Collections.emptyList(),
+          context().getCachedDataIds(HoodieDataCacheKey.of(writeConfig.getBasePath(), instant1)),
+          "RDDs cached for " + instant1 + " should be cleared.");
+      assertTrue(jsc().getPersistentRDDs().containsKey(persistedRdd0.getId()),

Review Comment:
   Should we combine these 3 assertions into 1 that asserts all RDDs persisted ids?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on a diff in pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "xushiyan (via GitHub)" <gi...@apache.org>.
xushiyan commented on code in PR #7914:
URL: https://github.com/apache/hudi/pull/7914#discussion_r1108085302


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/common/HoodieSparkEngineContext.java:
##########
@@ -180,4 +184,24 @@ public Option<String> getProperty(EngineProperty key) {
   public void setJobStatus(String activeModule, String activityDescription) {
     javaSparkContext.setJobGroup(activeModule, activityDescription);
   }
+
+  @Override
+  public void putCachedDataIds(String basePath, String instantTime, int... ids) {
+    Pair<String, String> key = Pair.of(basePath, instantTime);
+    cachedRddIds.putIfAbsent(key, new ArrayList<>());

Review Comment:
   yup good catch



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7914:
URL: https://github.com/apache/hudi/pull/7914#issuecomment-1425090547

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "15f5bae787294c0509a8e7b849132f08080c59cc",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15073",
       "triggerID" : "15f5bae787294c0509a8e7b849132f08080c59cc",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 15f5bae787294c0509a8e7b849132f08080c59cc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15073) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on a diff in pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "xushiyan (via GitHub)" <gi...@apache.org>.
xushiyan commented on code in PR #7914:
URL: https://github.com/apache/hudi/pull/7914#discussion_r1119397979


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/data/HoodieJavaRDD.java:
##########
@@ -81,11 +84,22 @@ public static <K, V> JavaPairRDD<K, V> getJavaRDD(HoodiePairData<K, V> hoodieDat
     return ((HoodieJavaPairRDD<K, V>) hoodieData).get();
   }
 
+  @Override
+  public int getId() {
+    return rddData.id();
+  }
+
   @Override
   public void persist(String level) {
     rddData.persist(StorageLevel.fromString(level));
   }
 
+  @Override
+  public void persist(String level, HoodieEngineContext engineContext, HoodieDataCacheKey cacheKey) {

Review Comment:
   `persist()` was used when `unpersist()` manually invoked immediately afterwards. see e.g. from org.apache.hudi.index.bloom.HoodieBloomIndex#tagLocation  added javadoc to the interface API to explain the usage



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7914:
URL: https://github.com/apache/hudi/pull/7914#issuecomment-1442656338

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "94ad2976cd88f284df074090b4da639d0a2eeeab",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15370",
       "triggerID" : "94ad2976cd88f284df074090b4da639d0a2eeeab",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 94ad2976cd88f284df074090b4da639d0a2eeeab Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15370) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7914:
URL: https://github.com/apache/hudi/pull/7914#issuecomment-1426620442

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "15f5bae787294c0509a8e7b849132f08080c59cc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15073",
       "triggerID" : "15f5bae787294c0509a8e7b849132f08080c59cc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "319132092c5c1521ff11a2100bd325e5a280459f",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "319132092c5c1521ff11a2100bd325e5a280459f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0d0da3dd9478911aca5e4c00148d26cc3e1e93f5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15081",
       "triggerID" : "0d0da3dd9478911aca5e4c00148d26cc3e1e93f5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "66780f5afc8835e99c9f0e81b5b9650003888447",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15090",
       "triggerID" : "66780f5afc8835e99c9f0e81b5b9650003888447",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b4471586094e6549c793b38276bab2b4907f2ab1",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15095",
       "triggerID" : "b4471586094e6549c793b38276bab2b4907f2ab1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0acb530ba86094563364881c54530479588769a6",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "0acb530ba86094563364881c54530479588769a6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 319132092c5c1521ff11a2100bd325e5a280459f UNKNOWN
   * b4471586094e6549c793b38276bab2b4907f2ab1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15095) 
   * 0acb530ba86094563364881c54530479588769a6 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7914:
URL: https://github.com/apache/hudi/pull/7914#issuecomment-1425182361

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "15f5bae787294c0509a8e7b849132f08080c59cc",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15073",
       "triggerID" : "15f5bae787294c0509a8e7b849132f08080c59cc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "319132092c5c1521ff11a2100bd325e5a280459f",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "319132092c5c1521ff11a2100bd325e5a280459f",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 15f5bae787294c0509a8e7b849132f08080c59cc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15073) 
   * 319132092c5c1521ff11a2100bd325e5a280459f UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7914:
URL: https://github.com/apache/hudi/pull/7914#issuecomment-1436281032

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0acb530ba86094563364881c54530479588769a6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15106",
       "triggerID" : "0acb530ba86094563364881c54530479588769a6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "533c544d2325954818f7d96b7f91f6dc3748d61a",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15284",
       "triggerID" : "533c544d2325954818f7d96b7f91f6dc3748d61a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 533c544d2325954818f7d96b7f91f6dc3748d61a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15284) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7914:
URL: https://github.com/apache/hudi/pull/7914#issuecomment-1436326211

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0acb530ba86094563364881c54530479588769a6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15106",
       "triggerID" : "0acb530ba86094563364881c54530479588769a6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "533c544d2325954818f7d96b7f91f6dc3748d61a",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15284",
       "triggerID" : "533c544d2325954818f7d96b7f91f6dc3748d61a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ebff5c5a2399351d21de4eb80e3437749c6d1209",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15289",
       "triggerID" : "ebff5c5a2399351d21de4eb80e3437749c6d1209",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 533c544d2325954818f7d96b7f91f6dc3748d61a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15284) 
   * ebff5c5a2399351d21de4eb80e3437749c6d1209 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15289) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on a diff in pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "xushiyan (via GitHub)" <gi...@apache.org>.
xushiyan commented on code in PR #7914:
URL: https://github.com/apache/hudi/pull/7914#discussion_r1106177142


##########
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestHoodieMergeOnReadTable.java:
##########
@@ -685,51 +684,4 @@ public void testHandleUpdateWithMultiplePartitions() throws Exception {
       assertEquals(fewRecordsForDelete.size() - numRecordsInPartition, status.getTotalErrorRecords());
     }
   }
-
-  @Test
-  public void testReleaseResource() throws Exception {

Review Comment:
   this testcase was covered by the newly added one



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on a diff in pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "xushiyan (via GitHub)" <gi...@apache.org>.
xushiyan commented on code in PR #7914:
URL: https://github.com/apache/hudi/pull/7914#discussion_r1106832148


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BaseCommitActionExecutor.java:
##########
@@ -246,6 +246,7 @@ protected HoodieWriteMetadata<HoodieData<WriteStatus>> executeClustering(HoodieC
         .performClustering(clusteringPlan, schema, instantTime);
     HoodieData<WriteStatus> writeStatusList = writeMetadata.getWriteStatuses();
     HoodieData<WriteStatus> statuses = updateIndex(writeStatusList, writeMetadata);
+    context.putCachedDataIds(config.getBasePath(), instantTime, statuses.getId());

Review Comment:
   i wasn't happy with tracing every persisting call and thought about this approach but also wanted to keep the impacting scope narrow. A change in all persist() call may lead to unexpected side effects.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on a diff in pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "xushiyan (via GitHub)" <gi...@apache.org>.
xushiyan commented on code in PR #7914:
URL: https://github.com/apache/hudi/pull/7914#discussion_r1106835666


##########
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/TestSparkRDDWriteClient.java:
##########
@@ -0,0 +1,123 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.client;
+
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.testutils.HoodieTestDataGenerator;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.testutils.SparkClientFunctionalTestHarness;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.storage.StorageLevel;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
+
+import java.io.IOException;
+import java.net.URI;
+import java.util.Collections;
+import java.util.List;
+import java.util.Properties;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+import static org.apache.hudi.common.testutils.HoodieTestDataGenerator.getCommitTimeAtUTC;
+import static org.apache.hudi.testutils.Assertions.assertNoWriteErrors;
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertFalse;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+class TestSparkRDDWriteClient extends SparkClientFunctionalTestHarness {
+
+  static Stream<Arguments> testWriteClientReleaseResourcesShouldOnlyUnpersistRelevantRdds() {
+    return Stream.of(
+        Arguments.of(HoodieTableType.COPY_ON_WRITE, true),
+        Arguments.of(HoodieTableType.MERGE_ON_READ, true),
+        Arguments.of(HoodieTableType.COPY_ON_WRITE, false),
+        Arguments.of(HoodieTableType.MERGE_ON_READ, false)
+    );
+  }
+
+  @ParameterizedTest
+  @MethodSource
+  void testWriteClientReleaseResourcesShouldOnlyUnpersistRelevantRdds(HoodieTableType tableType, boolean shouldReleaseResource) throws IOException {
+    final HoodieTableMetaClient metaClient = getHoodieMetaClient(hadoopConf(), URI.create(basePath()).getPath(), tableType, new Properties());
+    final HoodieWriteConfig writeConfig = getConfigBuilder(true)
+        .withPath(metaClient.getBasePathV2().toString())
+        .withAutoCommit(false)
+        .withReleaseResourceEnabled(shouldReleaseResource)
+        .withMetadataConfig(HoodieMetadataConfig.newBuilder().enable(false).build())
+        .build();
+    HoodieTestDataGenerator dataGen = new HoodieTestDataGenerator(0xDEED);
+
+    String instant0 = getCommitTimeAtUTC(0);
+    List<GenericRecord> extraRecords0 = dataGen.generateGenericRecords(10);
+    JavaRDD persistedRdd0 = jsc().parallelize(extraRecords0, 2).persist(StorageLevel.MEMORY_AND_DISK());
+    context().putCachedDataIds(writeConfig.getBasePath(), instant0, persistedRdd0.id());
+
+    String instant1 = getCommitTimeAtUTC(1);
+    List<GenericRecord> extraRecords1 = dataGen.generateGenericRecords(10);
+    JavaRDD persistedRdd1 = jsc().parallelize(extraRecords1, 2).persist(StorageLevel.MEMORY_AND_DISK());
+    context().putCachedDataIds(writeConfig.getBasePath(), instant1, persistedRdd1.id());
+
+    SparkRDDWriteClient writeClient = getHoodieWriteClient(writeConfig);
+    List<HoodieRecord> records = dataGen.generateInserts(instant1, 10);
+    JavaRDD<HoodieRecord> writeRecords = jsc().parallelize(records, 2);
+    writeClient.startCommitWithTime(instant1);
+    List<WriteStatus> writeStatuses = writeClient.insert(writeRecords, instant1).collect();
+    assertNoWriteErrors(writeStatuses);
+    writeClient.commitStats(instant1, writeStatuses.stream().map(WriteStatus::getStat).collect(Collectors.toList()),
+        Option.empty(), metaClient.getCommitActionType());
+    writeClient.close();
+
+    if (shouldReleaseResource) {
+      assertEquals(Collections.singletonList(persistedRdd0.id()),
+          context().getCachedDataIds(writeConfig.getBasePath(), instant0),
+          "RDDs cached for " + instant0 + " should be retained.");
+      assertEquals(Collections.emptyList(),

Review Comment:
   not quite sure about the suggested style change... usually prefer assertions in a straightforward manner for readability over condition or loop.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7914:
URL: https://github.com/apache/hudi/pull/7914#issuecomment-1436157885

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0acb530ba86094563364881c54530479588769a6",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15106",
       "triggerID" : "0acb530ba86094563364881c54530479588769a6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "533c544d2325954818f7d96b7f91f6dc3748d61a",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "533c544d2325954818f7d96b7f91f6dc3748d61a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 0acb530ba86094563364881c54530479588769a6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15106) 
   * 533c544d2325954818f7d96b7f91f6dc3748d61a UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7914:
URL: https://github.com/apache/hudi/pull/7914#issuecomment-1436445506

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0acb530ba86094563364881c54530479588769a6",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15106",
       "triggerID" : "0acb530ba86094563364881c54530479588769a6",
       "triggerType" : "PUSH"
     }, {
       "hash" : "533c544d2325954818f7d96b7f91f6dc3748d61a",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15284",
       "triggerID" : "533c544d2325954818f7d96b7f91f6dc3748d61a",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ebff5c5a2399351d21de4eb80e3437749c6d1209",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15289",
       "triggerID" : "ebff5c5a2399351d21de4eb80e3437749c6d1209",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ebff5c5a2399351d21de4eb80e3437749c6d1209 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15289) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7914:
URL: https://github.com/apache/hudi/pull/7914#issuecomment-1426574904

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "15f5bae787294c0509a8e7b849132f08080c59cc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15073",
       "triggerID" : "15f5bae787294c0509a8e7b849132f08080c59cc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "319132092c5c1521ff11a2100bd325e5a280459f",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "319132092c5c1521ff11a2100bd325e5a280459f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0d0da3dd9478911aca5e4c00148d26cc3e1e93f5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15081",
       "triggerID" : "0d0da3dd9478911aca5e4c00148d26cc3e1e93f5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "66780f5afc8835e99c9f0e81b5b9650003888447",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15090",
       "triggerID" : "66780f5afc8835e99c9f0e81b5b9650003888447",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b4471586094e6549c793b38276bab2b4907f2ab1",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15095",
       "triggerID" : "b4471586094e6549c793b38276bab2b4907f2ab1",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 319132092c5c1521ff11a2100bd325e5a280459f UNKNOWN
   * b4471586094e6549c793b38276bab2b4907f2ab1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15095) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7914:
URL: https://github.com/apache/hudi/pull/7914#issuecomment-1425573262

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "15f5bae787294c0509a8e7b849132f08080c59cc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15073",
       "triggerID" : "15f5bae787294c0509a8e7b849132f08080c59cc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "319132092c5c1521ff11a2100bd325e5a280459f",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "319132092c5c1521ff11a2100bd325e5a280459f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0d0da3dd9478911aca5e4c00148d26cc3e1e93f5",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15081",
       "triggerID" : "0d0da3dd9478911aca5e4c00148d26cc3e1e93f5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "66780f5afc8835e99c9f0e81b5b9650003888447",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "66780f5afc8835e99c9f0e81b5b9650003888447",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 15f5bae787294c0509a8e7b849132f08080c59cc Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15073) 
   * 319132092c5c1521ff11a2100bd325e5a280459f UNKNOWN
   * 0d0da3dd9478911aca5e4c00148d26cc3e1e93f5 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15081) 
   * 66780f5afc8835e99c9f0e81b5b9650003888447 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7914:
URL: https://github.com/apache/hudi/pull/7914#issuecomment-1425187692

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "15f5bae787294c0509a8e7b849132f08080c59cc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15073",
       "triggerID" : "15f5bae787294c0509a8e7b849132f08080c59cc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "319132092c5c1521ff11a2100bd325e5a280459f",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "319132092c5c1521ff11a2100bd325e5a280459f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0d0da3dd9478911aca5e4c00148d26cc3e1e93f5",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "0d0da3dd9478911aca5e4c00148d26cc3e1e93f5",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 15f5bae787294c0509a8e7b849132f08080c59cc Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15073) 
   * 319132092c5c1521ff11a2100bd325e5a280459f UNKNOWN
   * 0d0da3dd9478911aca5e4c00148d26cc3e1e93f5 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7914:
URL: https://github.com/apache/hudi/pull/7914#issuecomment-1426632450

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "15f5bae787294c0509a8e7b849132f08080c59cc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15073",
       "triggerID" : "15f5bae787294c0509a8e7b849132f08080c59cc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "319132092c5c1521ff11a2100bd325e5a280459f",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "319132092c5c1521ff11a2100bd325e5a280459f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0d0da3dd9478911aca5e4c00148d26cc3e1e93f5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15081",
       "triggerID" : "0d0da3dd9478911aca5e4c00148d26cc3e1e93f5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "66780f5afc8835e99c9f0e81b5b9650003888447",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15090",
       "triggerID" : "66780f5afc8835e99c9f0e81b5b9650003888447",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b4471586094e6549c793b38276bab2b4907f2ab1",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15095",
       "triggerID" : "b4471586094e6549c793b38276bab2b4907f2ab1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0acb530ba86094563364881c54530479588769a6",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15106",
       "triggerID" : "0acb530ba86094563364881c54530479588769a6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 319132092c5c1521ff11a2100bd325e5a280459f UNKNOWN
   * b4471586094e6549c793b38276bab2b4907f2ab1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15095) 
   * 0acb530ba86094563364881c54530479588769a6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15106) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7914:
URL: https://github.com/apache/hudi/pull/7914#issuecomment-1447284129

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "94ad2976cd88f284df074090b4da639d0a2eeeab",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15370",
       "triggerID" : "94ad2976cd88f284df074090b4da639d0a2eeeab",
       "triggerType" : "PUSH"
     }, {
       "hash" : "22e5065d585bacd3e4f13cd9c1db039a10d85d6b",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "22e5065d585bacd3e4f13cd9c1db039a10d85d6b",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 94ad2976cd88f284df074090b4da639d0a2eeeab Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15370) 
   * 22e5065d585bacd3e4f13cd9c1db039a10d85d6b UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on a diff in pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "xushiyan (via GitHub)" <gi...@apache.org>.
xushiyan commented on code in PR #7914:
URL: https://github.com/apache/hudi/pull/7914#discussion_r1111352056


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BaseCommitActionExecutor.java:
##########
@@ -246,6 +246,7 @@ protected HoodieWriteMetadata<HoodieData<WriteStatus>> executeClustering(HoodieC
         .performClustering(clusteringPlan, schema, instantTime);
     HoodieData<WriteStatus> writeStatusList = writeMetadata.getWriteStatuses();
     HoodieData<WriteStatus> statuses = updateIndex(writeStatusList, writeMetadata);
+    context.putCachedDataIds(config.getBasePath(), instantTime, statuses.getId());

Review Comment:
   fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7914:
URL: https://github.com/apache/hudi/pull/7914#issuecomment-1425192841

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "15f5bae787294c0509a8e7b849132f08080c59cc",
       "status" : "CANCELED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15073",
       "triggerID" : "15f5bae787294c0509a8e7b849132f08080c59cc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "319132092c5c1521ff11a2100bd325e5a280459f",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "319132092c5c1521ff11a2100bd325e5a280459f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0d0da3dd9478911aca5e4c00148d26cc3e1e93f5",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15081",
       "triggerID" : "0d0da3dd9478911aca5e4c00148d26cc3e1e93f5",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 15f5bae787294c0509a8e7b849132f08080c59cc Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15073) 
   * 319132092c5c1521ff11a2100bd325e5a280459f UNKNOWN
   * 0d0da3dd9478911aca5e4c00148d26cc3e1e93f5 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15081) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7914:
URL: https://github.com/apache/hudi/pull/7914#issuecomment-1426661291

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "15f5bae787294c0509a8e7b849132f08080c59cc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15073",
       "triggerID" : "15f5bae787294c0509a8e7b849132f08080c59cc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "319132092c5c1521ff11a2100bd325e5a280459f",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "319132092c5c1521ff11a2100bd325e5a280459f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0d0da3dd9478911aca5e4c00148d26cc3e1e93f5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15081",
       "triggerID" : "0d0da3dd9478911aca5e4c00148d26cc3e1e93f5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "66780f5afc8835e99c9f0e81b5b9650003888447",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15090",
       "triggerID" : "66780f5afc8835e99c9f0e81b5b9650003888447",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b4471586094e6549c793b38276bab2b4907f2ab1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15095",
       "triggerID" : "b4471586094e6549c793b38276bab2b4907f2ab1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0acb530ba86094563364881c54530479588769a6",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15106",
       "triggerID" : "0acb530ba86094563364881c54530479588769a6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 319132092c5c1521ff11a2100bd325e5a280459f UNKNOWN
   * 0acb530ba86094563364881c54530479588769a6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15106) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7914:
URL: https://github.com/apache/hudi/pull/7914#issuecomment-1426871157

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0acb530ba86094563364881c54530479588769a6",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15106",
       "triggerID" : "0acb530ba86094563364881c54530479588769a6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 0acb530ba86094563364881c54530479588769a6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15106) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7914:
URL: https://github.com/apache/hudi/pull/7914#issuecomment-1426888196

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "0acb530ba86094563364881c54530479588769a6",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15106",
       "triggerID" : "0acb530ba86094563364881c54530479588769a6",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 0acb530ba86094563364881c54530479588769a6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15106) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #7914:
URL: https://github.com/apache/hudi/pull/7914#issuecomment-1426362231

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "15f5bae787294c0509a8e7b849132f08080c59cc",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15073",
       "triggerID" : "15f5bae787294c0509a8e7b849132f08080c59cc",
       "triggerType" : "PUSH"
     }, {
       "hash" : "319132092c5c1521ff11a2100bd325e5a280459f",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "319132092c5c1521ff11a2100bd325e5a280459f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0d0da3dd9478911aca5e4c00148d26cc3e1e93f5",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15081",
       "triggerID" : "0d0da3dd9478911aca5e4c00148d26cc3e1e93f5",
       "triggerType" : "PUSH"
     }, {
       "hash" : "66780f5afc8835e99c9f0e81b5b9650003888447",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15090",
       "triggerID" : "66780f5afc8835e99c9f0e81b5b9650003888447",
       "triggerType" : "PUSH"
     }, {
       "hash" : "b4471586094e6549c793b38276bab2b4907f2ab1",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "b4471586094e6549c793b38276bab2b4907f2ab1",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 319132092c5c1521ff11a2100bd325e5a280459f UNKNOWN
   * 66780f5afc8835e99c9f0e81b5b9650003888447 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=15090) 
   * b4471586094e6549c793b38276bab2b4907f2ab1 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "alexeykudinkin (via GitHub)" <gi...@apache.org>.
alexeykudinkin commented on code in PR #7914:
URL: https://github.com/apache/hudi/pull/7914#discussion_r1107408876


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BaseCommitActionExecutor.java:
##########
@@ -246,6 +246,7 @@ protected HoodieWriteMetadata<HoodieData<WriteStatus>> executeClustering(HoodieC
         .performClustering(clusteringPlan, schema, instantTime);
     HoodieData<WriteStatus> writeStatusList = writeMetadata.getWriteStatuses();
     HoodieData<WriteStatus> statuses = updateIndex(writeStatusList, writeMetadata);
+    context.putCachedDataIds(config.getBasePath(), instantTime, statuses.getId());

Review Comment:
   HoodieData is already tightly coupled (1:1) with HoodieEngineContext so there's nothing shady about HD API accepting HEC.
   
   Current approach doesn't really make sense as it's extremely brittle -- we can't expect that someone will be aware of needing to register the RDD whenever they persist.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on a diff in pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "xushiyan (via GitHub)" <gi...@apache.org>.
xushiyan commented on code in PR #7914:
URL: https://github.com/apache/hudi/pull/7914#discussion_r1111352227


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/common/HoodieSparkEngineContext.java:
##########
@@ -58,6 +61,7 @@ public class HoodieSparkEngineContext extends HoodieEngineContext {
   private static final Logger LOG = LogManager.getLogger(HoodieSparkEngineContext.class);
   private final JavaSparkContext javaSparkContext;
   private final SQLContext sqlContext;
+  private final Map<Pair<String, String>, List<Integer>> cachedRddIds = new ConcurrentHashMap<>();

Review Comment:
   clarified in `HoodieDataCacheKey` javadoc



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on a diff in pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "xushiyan (via GitHub)" <gi...@apache.org>.
xushiyan commented on code in PR #7914:
URL: https://github.com/apache/hudi/pull/7914#discussion_r1106832148


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BaseCommitActionExecutor.java:
##########
@@ -246,6 +246,7 @@ protected HoodieWriteMetadata<HoodieData<WriteStatus>> executeClustering(HoodieC
         .performClustering(clusteringPlan, schema, instantTime);
     HoodieData<WriteStatus> writeStatusList = writeMetadata.getWriteStatuses();
     HoodieData<WriteStatus> statuses = updateIndex(writeStatusList, writeMetadata);
+    context.putCachedDataIds(config.getBasePath(), instantTime, statuses.getId());

Review Comment:
   i wasn't happy with tracing every persisting call and thought about this approach but also wanted to keep the impacting scope narrow. A change in all persist() call may lead to unexpected side effects. Also looks a bit weird to have a HoodieData to know about any HoodieEngineContext. Having HoodieEngineContext tracing all HoodieData from it's born and auto-cache its id makes more sense but it's a much bigger change wrt this PR's intention



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan merged pull request #7914: [HUDI-5080] Unpersist only relevant RDDs instead of all

Posted by "xushiyan (via GitHub)" <gi...@apache.org>.
xushiyan merged PR #7914:
URL: https://github.com/apache/hudi/pull/7914


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org