You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/07/15 00:59:15 UTC

[GitHub] [hudi] yihua opened a new pull request, #6113: [HUDI-4400] Fix missing bloom filters in metadata table in non-partitioned table

yihua opened a new pull request, #6113:
URL: https://github.com/apache/hudi/pull/6113

   ## What is the purpose of the pull request
   
   This PR fixes the missing bloom filters in metadata table in the non-partitioned table due to incorrect record key generation.  Before this PR, the file name is wrong when generating the metadata payload for the bloom filter.  For example, below shows the file name used to construct the metadata payload:
   ```
   Filename: 03656eb-c000-474b-945e-aa9298c3334d_1-0-1_0000001.parquet Bloom filter record key: DW/eaNVbRdo=xDmB/pnnQIMnCbUZywNZxw==
   Filename: f1a759f-8e00-4cc4-8af0-676d3c892657_1-0-1_0000002.parquet Bloom filter record key: DW/eaNVbRdo=t/6nT2vbZbsGoSkZBCOKZA==
   Filename: ca4aa60-2659-4fae-9d57-c4f51e8a7343_1-0-1_0000003.parquet Bloom filter record key: DW/eaNVbRdo=DsnvarlysKz9lJxfoZ81iA==
   ```
   The file name misses the first character.  In Bloom Index, when doing a lookup in the metadata table based on the actual file name, the corresponding bloom filter cannot be found because the record key generated during the lookup does not match what's stored in the metadata table, causing the upsert to fail:
   ```
   BaseTableMetadata: BloomFilterIndex pair:  0f1a759f-8e00-4cc4-8af0-676d3c892657_1-0-1_0000002.parquet
   BaseTableMetadata: BloomFilterIndex pair:  eca4aa60-2659-4fae-9d57-c4f51e8a7343_1-0-1_0000003.parquet
   ```
   ```
   Caused by: org.apache.hudi.exception.HoodieIndexException: Failed to get the bloom filter for (,0f1a759f-8e00-4cc4-8af0-676d3c892657_1-0-1_0000002.parquet)
   	at org.apache.hudi.index.bloom.HoodieMetadataBloomIndexCheckFunction$BloomIndexLazyKeyCheckIterator.lambda$computeNext$2(HoodieMetadataBloomIndexCheckFunction.java:127)
   	at java.util.HashMap.forEach(HashMap.java:1289)
   	at org.apache.hudi.index.bloom.HoodieMetadataBloomIndexCheckFunction$BloomIndexLazyKeyCheckIterator.computeNext(HoodieMetadataBloomIndexCheckFunction.java:120)
   	at org.apache.hudi.index.bloom.HoodieMetadataBloomIndexCheckFunction$BloomIndexLazyKeyCheckIterator.computeNext(HoodieMetadataBloomIndexCheckFunction.java:76)
   	at org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:119)
   	... 15 more
   ```
   The fix is to generate the correct file name for the non-partitioned table.
   
   ## Brief change log
   
     - Fixes the logic of generating file name for the non-partitioned table in `HoodieTableMetadataUtil`
     - Adds unit tests for Bloom Index using metadata table, for both partitioned and non-partitioned table
     - Fixes commit metadata generation for non-partitioned table
   
   ## Verify this pull request
   
   This PR adds unit tests for Bloom Index using metadata table so that all existing tests run in two setups, w/ and w/o using metadata table for column stats and bloom filters.  This PR also adds the tests for non-partitioned tables.  Before the fix, the tests for non-partitioned tables fail.  After the fix, the same set of tests succeeded.  The fix is verified to resolve the problem for upserts on S3 using Bloom Index with metadata table read.
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6113: [HUDI-4400] Fix missing bloom filters in metadata table in non-partitioned table

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #6113:
URL: https://github.com/apache/hudi/pull/6113#discussion_r922499324


##########
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/index/bloom/TestHoodieBloomIndex.java:
##########
@@ -110,24 +116,35 @@ public void tearDown() throws Exception {
     cleanupResources();
   }
 
-  private HoodieWriteConfig makeConfig(boolean rangePruning, boolean treeFiltering, boolean bucketizedChecking) {
+  private HoodieWriteConfig makeConfig(
+      boolean rangePruning, boolean treeFiltering, boolean bucketizedChecking, boolean useMetadataTable) {
+    // For the bloom index to use column stats and bloom filters from metadata table,
+    // the following configs must be set to true:
+    // "hoodie.bloom.index.use.metadata"
+    // "hoodie.metadata.enable" (by default is true)
+    // "hoodie.metadata.index.column.stats.enable"
+    // "hoodie.metadata.index.bloom.filter.enable"
     return HoodieWriteConfig.newBuilder().withPath(basePath)
         .withIndexConfig(HoodieIndexConfig.newBuilder().bloomIndexPruneByRanges(rangePruning)
             .bloomIndexTreebasedFilter(treeFiltering).bloomIndexBucketizedChecking(bucketizedChecking)
-            .bloomIndexKeysPerBucket(2).build())
+            .bloomIndexKeysPerBucket(2).bloomIndexUseMetadata(useMetadataTable).build())

Review Comment:
   Can we please line up them on individual lines so that it's easier to understand which config attributes where



##########
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/index/bloom/TestHoodieBloomIndex.java:
##########
@@ -428,7 +455,99 @@ public void testTagLocation(boolean rangePruning, boolean treeFiltering, boolean
 
   @ParameterizedTest(name = TEST_NAME_WITH_PARAMS)
   @MethodSource("configParams")
-  public void testCheckExists(boolean rangePruning, boolean treeFiltering, boolean bucketizedChecking) throws Exception {
+  public void testTagLocationOnNonpartitionedTable(
+      boolean rangePruning, boolean treeFiltering, boolean bucketizedChecking,
+      boolean useMetadataTable) throws Exception {
+    // We have some records to be tagged (two different partitions)
+    String rowKey1 = UUID.randomUUID().toString();

Review Comment:
   Let's generate UUID from a random w/ fixed seed (so that they don't change from run to run), there's `genPseudoRandomUUID` specifically for that



##########
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/index/bloom/TestHoodieBloomIndex.java:
##########
@@ -428,7 +455,99 @@ public void testTagLocation(boolean rangePruning, boolean treeFiltering, boolean
 
   @ParameterizedTest(name = TEST_NAME_WITH_PARAMS)
   @MethodSource("configParams")
-  public void testCheckExists(boolean rangePruning, boolean treeFiltering, boolean bucketizedChecking) throws Exception {
+  public void testTagLocationOnNonpartitionedTable(
+      boolean rangePruning, boolean treeFiltering, boolean bucketizedChecking,
+      boolean useMetadataTable) throws Exception {
+    // We have some records to be tagged (two different partitions)
+    String rowKey1 = UUID.randomUUID().toString();
+    String rowKey2 = UUID.randomUUID().toString();
+    String rowKey3 = UUID.randomUUID().toString();
+    String recordStr1 = "{\"_row_key\":\"" + rowKey1 + "\",\"time\":\"2016-01-31T03:16:41.415Z\",\"number\":12}";
+    String recordStr2 = "{\"_row_key\":\"" + rowKey2 + "\",\"time\":\"2016-01-31T03:20:41.415Z\",\"number\":100}";
+    String recordStr3 = "{\"_row_key\":\"" + rowKey3 + "\",\"time\":\"2016-01-31T03:16:41.415Z\",\"number\":15}";
+
+    String emptyPartitionPath = "";
+    RawTripTestPayload rowChange1 = new RawTripTestPayload(recordStr1);
+    HoodieRecord record1 =
+        new HoodieAvroRecord(new HoodieKey(rowChange1.getRowKey(), emptyPartitionPath), rowChange1);
+    RawTripTestPayload rowChange2 = new RawTripTestPayload(recordStr2);
+    HoodieRecord record2 =
+        new HoodieAvroRecord(new HoodieKey(rowChange2.getRowKey(), emptyPartitionPath), rowChange2);
+    RawTripTestPayload rowChange3 = new RawTripTestPayload(recordStr3);
+    HoodieRecord record3 =
+        new HoodieAvroRecord(new HoodieKey(rowChange3.getRowKey(), emptyPartitionPath), rowChange3);
+
+    JavaRDD<HoodieRecord> recordRDD = jsc.parallelize(Arrays.asList(record1, record2, record3));
+
+    // Also create the metadata and config
+    HoodieWriteConfig config =
+        makeConfig(rangePruning, treeFiltering, bucketizedChecking, useMetadataTable);
+    HoodieSparkTable hoodieTable = HoodieSparkTable.create(config, context, metaClient);
+    metadataWriter = SparkHoodieBackedTableMetadataWriter.create(hadoopConf, config, context);
+    HoodieSparkWriteableTestTable testTable = HoodieSparkWriteableTestTable.of(metaClient, SCHEMA, metadataWriter);
+
+    // Let's tag
+    HoodieBloomIndex bloomIndex = new HoodieBloomIndex(config, SparkHoodieBloomIndexHelper.getInstance());
+    JavaRDD<HoodieRecord> taggedRecordRDD = tagLocation(bloomIndex, recordRDD, hoodieTable);
+
+    // Should not find any files
+    for (HoodieRecord record : taggedRecordRDD.collect()) {
+      assertFalse(record.isCurrentLocationKnown());
+    }
+
+    final Map<String, List<Pair<String, Integer>>> partitionToFilesNameLengthMap = new HashMap<>();
+
+    // We create three parquet file, each having one record
+    final String fileId1 = UUID.randomUUID().toString();

Review Comment:
   Same comment as above



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua merged pull request #6113: [HUDI-4400] Fix missing bloom filters in metadata table in non-partitioned table

Posted by GitBox <gi...@apache.org>.
yihua merged PR #6113:
URL: https://github.com/apache/hudi/pull/6113


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] vinothchandar commented on a diff in pull request #6113: [HUDI-4400] Fix missing bloom filters in metadata table in non-partitioned table

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on code in PR #6113:
URL: https://github.com/apache/hudi/pull/6113#discussion_r929275152


##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##########
@@ -409,8 +409,11 @@ public static HoodieData<HoodieRecord> convertMetadataToBloomFilterRecords(
         LOG.error("Failed to find path in write stat to update metadata table " + hoodieWriteStat);
         return Collections.emptyListIterator();
       }
-      int offset = partition.equals(NON_PARTITIONED_NAME) ? (pathWithPartition.startsWith("/") ? 1 : 0) :
-          partition.length() + 1;
+
+      // For partitioned table, "partition" contains the relative partition path;
+      // for non-partitioned table, "partition" is empty
+      int offset = StringUtils.isNullOrEmpty(partition)

Review Comment:
   Fetching a file name from full path and partition path, should a helper on FSUtils? Move this logic there? better yet, Can't we just String.replace the first occurrence of partition path within the full path and we are done without the index/offset business?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on a diff in pull request #6113: [HUDI-4400] Fix missing bloom filters in metadata table in non-partitioned table

Posted by GitBox <gi...@apache.org>.
yihua commented on code in PR #6113:
URL: https://github.com/apache/hudi/pull/6113#discussion_r933637330


##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##########
@@ -409,8 +409,11 @@ public static HoodieData<HoodieRecord> convertMetadataToBloomFilterRecords(
         LOG.error("Failed to find path in write stat to update metadata table " + hoodieWriteStat);
         return Collections.emptyListIterator();
       }
-      int offset = partition.equals(NON_PARTITIONED_NAME) ? (pathWithPartition.startsWith("/") ? 1 : 0) :
-          partition.length() + 1;
+
+      // For partitioned table, "partition" contains the relative partition path;
+      // for non-partitioned table, "partition" is empty
+      int offset = StringUtils.isNullOrEmpty(partition)

Review Comment:
   Addressed in #6250.  `String.replace` could be slow so I still use the current logic.  I moved it into a util method.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6113: [HUDI-4400] Fix missing bloom filters in metadata table in non-partitioned table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6113:
URL: https://github.com/apache/hudi/pull/6113#issuecomment-1191346815

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "445f9da3a360ea62ddd71e7669a9973e8a8bb2ff",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9933",
       "triggerID" : "445f9da3a360ea62ddd71e7669a9973e8a8bb2ff",
       "triggerType" : "PUSH"
     }, {
       "hash" : "be1514fdda14e808df07901a26482371ec391ae8",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10127",
       "triggerID" : "be1514fdda14e808df07901a26482371ec391ae8",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * be1514fdda14e808df07901a26482371ec391ae8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10127) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on a diff in pull request #6113: [HUDI-4400] Fix missing bloom filters in metadata table in non-partitioned table

Posted by GitBox <gi...@apache.org>.
yihua commented on code in PR #6113:
URL: https://github.com/apache/hudi/pull/6113#discussion_r926267851


##########
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/index/bloom/TestHoodieBloomIndex.java:
##########
@@ -428,7 +455,99 @@ public void testTagLocation(boolean rangePruning, boolean treeFiltering, boolean
 
   @ParameterizedTest(name = TEST_NAME_WITH_PARAMS)
   @MethodSource("configParams")
-  public void testCheckExists(boolean rangePruning, boolean treeFiltering, boolean bucketizedChecking) throws Exception {
+  public void testTagLocationOnNonpartitionedTable(
+      boolean rangePruning, boolean treeFiltering, boolean bucketizedChecking,
+      boolean useMetadataTable) throws Exception {
+    // We have some records to be tagged (two different partitions)
+    String rowKey1 = UUID.randomUUID().toString();
+    String rowKey2 = UUID.randomUUID().toString();
+    String rowKey3 = UUID.randomUUID().toString();
+    String recordStr1 = "{\"_row_key\":\"" + rowKey1 + "\",\"time\":\"2016-01-31T03:16:41.415Z\",\"number\":12}";
+    String recordStr2 = "{\"_row_key\":\"" + rowKey2 + "\",\"time\":\"2016-01-31T03:20:41.415Z\",\"number\":100}";
+    String recordStr3 = "{\"_row_key\":\"" + rowKey3 + "\",\"time\":\"2016-01-31T03:16:41.415Z\",\"number\":15}";
+
+    String emptyPartitionPath = "";
+    RawTripTestPayload rowChange1 = new RawTripTestPayload(recordStr1);
+    HoodieRecord record1 =
+        new HoodieAvroRecord(new HoodieKey(rowChange1.getRowKey(), emptyPartitionPath), rowChange1);
+    RawTripTestPayload rowChange2 = new RawTripTestPayload(recordStr2);
+    HoodieRecord record2 =
+        new HoodieAvroRecord(new HoodieKey(rowChange2.getRowKey(), emptyPartitionPath), rowChange2);
+    RawTripTestPayload rowChange3 = new RawTripTestPayload(recordStr3);
+    HoodieRecord record3 =
+        new HoodieAvroRecord(new HoodieKey(rowChange3.getRowKey(), emptyPartitionPath), rowChange3);
+
+    JavaRDD<HoodieRecord> recordRDD = jsc.parallelize(Arrays.asList(record1, record2, record3));
+
+    // Also create the metadata and config
+    HoodieWriteConfig config =
+        makeConfig(rangePruning, treeFiltering, bucketizedChecking, useMetadataTable);
+    HoodieSparkTable hoodieTable = HoodieSparkTable.create(config, context, metaClient);
+    metadataWriter = SparkHoodieBackedTableMetadataWriter.create(hadoopConf, config, context);
+    HoodieSparkWriteableTestTable testTable = HoodieSparkWriteableTestTable.of(metaClient, SCHEMA, metadataWriter);
+
+    // Let's tag
+    HoodieBloomIndex bloomIndex = new HoodieBloomIndex(config, SparkHoodieBloomIndexHelper.getInstance());
+    JavaRDD<HoodieRecord> taggedRecordRDD = tagLocation(bloomIndex, recordRDD, hoodieTable);
+
+    // Should not find any files
+    for (HoodieRecord record : taggedRecordRDD.collect()) {
+      assertFalse(record.isCurrentLocationKnown());
+    }
+
+    final Map<String, List<Pair<String, Integer>>> partitionToFilesNameLengthMap = new HashMap<>();
+
+    // We create three parquet file, each having one record
+    final String fileId1 = UUID.randomUUID().toString();

Review Comment:
   Addressed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6113: [HUDI-4400] Fix missing bloom filters in metadata table in non-partitioned table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6113:
URL: https://github.com/apache/hudi/pull/6113#issuecomment-1191707602

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "be1514fdda14e808df07901a26482371ec391ae8",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "be1514fdda14e808df07901a26482371ec391ae8",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * be1514fdda14e808df07901a26482371ec391ae8 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on pull request #6113: [HUDI-4400] Fix missing bloom filters in metadata table in non-partitioned table

Posted by GitBox <gi...@apache.org>.
yihua commented on PR #6113:
URL: https://github.com/apache/hudi/pull/6113#issuecomment-1191707668

   Azure CI passes before rebasing.  I'm going to merge the PR once Java CI passes.
   <img width="1115" alt="Screen Shot 2022-07-21 at 09 18 48" src="https://user-images.githubusercontent.com/2497195/180267520-139bf3eb-5a36-41d8-a8e4-5b03b81f595f.png">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6113: [HUDI-4400] Fix missing bloom filters in metadata table in non-partitioned table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6113:
URL: https://github.com/apache/hudi/pull/6113#issuecomment-1191716324

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "be1514fdda14e808df07901a26482371ec391ae8",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10127",
       "triggerID" : "be1514fdda14e808df07901a26482371ec391ae8",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f82f2a6b0757e62e29e5e5bcaef6a040920aaa7a",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10153",
       "triggerID" : "f82f2a6b0757e62e29e5e5bcaef6a040920aaa7a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * be1514fdda14e808df07901a26482371ec391ae8 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10127) 
   * f82f2a6b0757e62e29e5e5bcaef6a040920aaa7a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10153) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6113: [HUDI-4400] Fix missing bloom filters in metadata table in non-partitioned table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6113:
URL: https://github.com/apache/hudi/pull/6113#issuecomment-1185068448

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "445f9da3a360ea62ddd71e7669a9973e8a8bb2ff",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "445f9da3a360ea62ddd71e7669a9973e8a8bb2ff",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 445f9da3a360ea62ddd71e7669a9973e8a8bb2ff UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on a diff in pull request #6113: [HUDI-4400] Fix missing bloom filters in metadata table in non-partitioned table

Posted by GitBox <gi...@apache.org>.
codope commented on code in PR #6113:
URL: https://github.com/apache/hudi/pull/6113#discussion_r923152896


##########
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/index/bloom/TestHoodieBloomIndex.java:
##########
@@ -428,7 +455,99 @@ public void testTagLocation(boolean rangePruning, boolean treeFiltering, boolean
 
   @ParameterizedTest(name = TEST_NAME_WITH_PARAMS)
   @MethodSource("configParams")
-  public void testCheckExists(boolean rangePruning, boolean treeFiltering, boolean bucketizedChecking) throws Exception {
+  public void testTagLocationOnNonpartitionedTable(
+      boolean rangePruning, boolean treeFiltering, boolean bucketizedChecking,
+      boolean useMetadataTable) throws Exception {
+    // We have some records to be tagged (two different partitions)
+    String rowKey1 = UUID.randomUUID().toString();

Review Comment:
   +1
   Would be good to do this for all tests.
   Just take the change in `FileSystemTestUtils` https://github.com/apache/hudi/pull/6049/files#diff-39cc6a706b7cad836c15caf1a61256c5f50d4ac8ea511de5a4a29f8deb3a2f6c
   ```
   public static final Random RANDOM = new Random(0xDEED);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on a diff in pull request #6113: [HUDI-4400] Fix missing bloom filters in metadata table in non-partitioned table

Posted by GitBox <gi...@apache.org>.
yihua commented on code in PR #6113:
URL: https://github.com/apache/hudi/pull/6113#discussion_r926262573


##########
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/index/bloom/TestHoodieBloomIndex.java:
##########
@@ -110,24 +116,35 @@ public void tearDown() throws Exception {
     cleanupResources();
   }
 
-  private HoodieWriteConfig makeConfig(boolean rangePruning, boolean treeFiltering, boolean bucketizedChecking) {
+  private HoodieWriteConfig makeConfig(
+      boolean rangePruning, boolean treeFiltering, boolean bucketizedChecking, boolean useMetadataTable) {
+    // For the bloom index to use column stats and bloom filters from metadata table,
+    // the following configs must be set to true:
+    // "hoodie.bloom.index.use.metadata"
+    // "hoodie.metadata.enable" (by default is true)
+    // "hoodie.metadata.index.column.stats.enable"
+    // "hoodie.metadata.index.bloom.filter.enable"
     return HoodieWriteConfig.newBuilder().withPath(basePath)
         .withIndexConfig(HoodieIndexConfig.newBuilder().bloomIndexPruneByRanges(rangePruning)
             .bloomIndexTreebasedFilter(treeFiltering).bloomIndexBucketizedChecking(bucketizedChecking)
-            .bloomIndexKeysPerBucket(2).build())
+            .bloomIndexKeysPerBucket(2).bloomIndexUseMetadata(useMetadataTable).build())

Review Comment:
   Addressed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6113: [HUDI-4400] Fix missing bloom filters in metadata table in non-partitioned table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6113:
URL: https://github.com/apache/hudi/pull/6113#issuecomment-1191711992

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "be1514fdda14e808df07901a26482371ec391ae8",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10127",
       "triggerID" : "be1514fdda14e808df07901a26482371ec391ae8",
       "triggerType" : "PUSH"
     }, {
       "hash" : "f82f2a6b0757e62e29e5e5bcaef6a040920aaa7a",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "f82f2a6b0757e62e29e5e5bcaef6a040920aaa7a",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * be1514fdda14e808df07901a26482371ec391ae8 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10127) 
   * f82f2a6b0757e62e29e5e5bcaef6a040920aaa7a UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6113: [HUDI-4400] Fix missing bloom filters in metadata table in non-partitioned table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6113:
URL: https://github.com/apache/hudi/pull/6113#issuecomment-1185070603

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "445f9da3a360ea62ddd71e7669a9973e8a8bb2ff",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9933",
       "triggerID" : "445f9da3a360ea62ddd71e7669a9973e8a8bb2ff",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 445f9da3a360ea62ddd71e7669a9973e8a8bb2ff Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9933) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6113: [HUDI-4400] Fix missing bloom filters in metadata table in non-partitioned table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6113:
URL: https://github.com/apache/hudi/pull/6113#issuecomment-1185090301

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "445f9da3a360ea62ddd71e7669a9973e8a8bb2ff",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9933",
       "triggerID" : "445f9da3a360ea62ddd71e7669a9973e8a8bb2ff",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 445f9da3a360ea62ddd71e7669a9973e8a8bb2ff Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9933) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6113: [HUDI-4400] Fix missing bloom filters in metadata table in non-partitioned table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6113:
URL: https://github.com/apache/hudi/pull/6113#issuecomment-1191103819

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "445f9da3a360ea62ddd71e7669a9973e8a8bb2ff",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9933",
       "triggerID" : "445f9da3a360ea62ddd71e7669a9973e8a8bb2ff",
       "triggerType" : "PUSH"
     }, {
       "hash" : "be1514fdda14e808df07901a26482371ec391ae8",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10127",
       "triggerID" : "be1514fdda14e808df07901a26482371ec391ae8",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 445f9da3a360ea62ddd71e7669a9973e8a8bb2ff Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9933) 
   * be1514fdda14e808df07901a26482371ec391ae8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10127) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6113: [HUDI-4400] Fix missing bloom filters in metadata table in non-partitioned table

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6113:
URL: https://github.com/apache/hudi/pull/6113#issuecomment-1191100089

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "445f9da3a360ea62ddd71e7669a9973e8a8bb2ff",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9933",
       "triggerID" : "445f9da3a360ea62ddd71e7669a9973e8a8bb2ff",
       "triggerType" : "PUSH"
     }, {
       "hash" : "be1514fdda14e808df07901a26482371ec391ae8",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "be1514fdda14e808df07901a26482371ec391ae8",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 445f9da3a360ea62ddd71e7669a9973e8a8bb2ff Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=9933) 
   * be1514fdda14e808df07901a26482371ec391ae8 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on a diff in pull request #6113: [HUDI-4400] Fix missing bloom filters in metadata table in non-partitioned table

Posted by GitBox <gi...@apache.org>.
yihua commented on code in PR #6113:
URL: https://github.com/apache/hudi/pull/6113#discussion_r926267751


##########
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/index/bloom/TestHoodieBloomIndex.java:
##########
@@ -428,7 +455,99 @@ public void testTagLocation(boolean rangePruning, boolean treeFiltering, boolean
 
   @ParameterizedTest(name = TEST_NAME_WITH_PARAMS)
   @MethodSource("configParams")
-  public void testCheckExists(boolean rangePruning, boolean treeFiltering, boolean bucketizedChecking) throws Exception {
+  public void testTagLocationOnNonpartitionedTable(
+      boolean rangePruning, boolean treeFiltering, boolean bucketizedChecking,
+      boolean useMetadataTable) throws Exception {
+    // We have some records to be tagged (two different partitions)
+    String rowKey1 = UUID.randomUUID().toString();

Review Comment:
   Sg.  I make `genPseudoRandomUUID()` public and use that with a pseudo random to generate all UUIDs in this test class.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org