You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/06/06 22:23:15 UTC
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5773: [HUDI-4200] Fixing sorting of keys fetched from metadata table

alexeykudinkin commented on code in PR #5773:
URL: https://github.com/apache/hudi/pull/5773#discussion_r890611911


##########
hudi-client/hudi-client-common/src/test/java/org/apache/hudi/io/storage/TestHoodieHFileReaderWriter.java:
##########
@@ -316,15 +317,20 @@ public void testReaderGetRecordIteratorByKeyPrefixes() throws Exception {
     assertEquals(expectedKey50and0s, recordsByPrefix);
 
     // filter for "key1" and "key0" : entries from 'key10 to key19' and 'key00 to key09' should be matched.
-    List<GenericRecord> expectedKey1sand0s = expectedKey1s;
-    expectedKey1sand0s.addAll(allRecords.stream()
-        .filter(entry -> (entry.get("_row_key").toString()).contains("key0"))
-        .collect(Collectors.toList()));
+    List<GenericRecord> expectedKey1sand0s = allRecords.stream()
+        .filter(entry -> (entry.get("_row_key").toString()).contains("key1") || (entry.get("_row_key").toString()).contains("key0"))
+        .collect(Collectors.toList());
     iterator =
         hfileReader.getRecordsByKeyPrefixIterator(Arrays.asList("key1", "key0"), avroSchema);
     recordsByPrefix =
         StreamSupport.stream(Spliterators.spliteratorUnknownSize(iterator, Spliterator.ORDERED), false)
             .collect(Collectors.toList());
+    Collections.sort(recordsByPrefix, new Comparator<GenericRecord>() {

Review Comment:
   Why is this needed?



##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java:
##########
@@ -192,8 +197,12 @@ public HoodieData<HoodieRecord<HoodieMetadataPayload>> getRecordsByKeyPrefixes(L
   }
 
   @Override
-  public List<Pair<String, Option<HoodieRecord<HoodieMetadataPayload>>>> getRecordsByKeys(List<String> keys,
+  public List<Pair<String, Option<HoodieRecord<HoodieMetadataPayload>>>> getRecordsByKeys(List<String> keysUnsorted,
                                                                                           String partitionName) {
+    // Sort the columns so that keys are looked up in order
+    List<String> keys = new ArrayList<>();

Review Comment:
   Same as above



##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java:
##########
@@ -142,8 +142,13 @@ protected Option<HoodieRecord<HoodieMetadataPayload>> getRecordByKey(String key,
   }
 
   @Override
-  public HoodieData<HoodieRecord<HoodieMetadataPayload>> getRecordsByKeyPrefixes(List<String> keyPrefixes,
+  public HoodieData<HoodieRecord<HoodieMetadataPayload>> getRecordsByKeyPrefixes(List<String> keyPrefixesUnsorted,
                                                                                  String partitionName) {
+    // Sort the columns so that keys are looked up in order
+    List<String> keyPrefixes = new ArrayList<>();

Review Comment:
   In general it's better to use `new ArrayList(col)` ctor as this would avoid subsequent re-allocation



##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java:
##########
@@ -142,8 +142,13 @@ protected Option<HoodieRecord<HoodieMetadataPayload>> getRecordByKey(String key,
   }
 
   @Override
-  public HoodieData<HoodieRecord<HoodieMetadataPayload>> getRecordsByKeyPrefixes(List<String> keyPrefixes,
+  public HoodieData<HoodieRecord<HoodieMetadataPayload>> getRecordsByKeyPrefixes(List<String> keyPrefixesUnsorted,

Review Comment:
   Name is misleading -- prefixes might be sorted. I don't think we need to change the name. we just need to make sure we're sorting and that's it



##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestColumnStatsIndex.scala:
##########
@@ -250,7 +250,7 @@ class TestColumnStatsIndex extends HoodieClientTestBase with ColumnStatsIndexSup
 
     {
       // We have to include "c1", since we sort the expected outputs by this column
-      val requestedColumns = Seq("c1", "c4")
+      val requestedColumns = Seq("c2", "c1", "c4")

Review Comment:
   nit: Flipping the order we could have avoided need to change the fixture



##########
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileReader.java:
##########
@@ -259,11 +259,8 @@ private static Iterator<GenericRecord> getRecordByKeyPrefixIteratorInternal(HFil
         return Collections.emptyIterator();
       }
     } else if (val == -1) {
-      // If scanner is aleady on the top of hfile. avoid trigger seekTo again.
-      Option<Cell> headerCell = Option.fromJavaOptional(scanner.getReader().getFirstKey());
-      if (headerCell.isPresent() && !headerCell.get().equals(scanner.getCell())) {
-        scanner.seekTo();
-      }
+      // seek to beginning. anyways, its key prefix search.

Review Comment:
   Let's elaborate the comment to make sure someone reading it w/o context is able to understand it: 
   Whenever `val == -1` HFile reader will place the pointer right before the first record. We have to advance it to the first record of the file to validate whether it matches our search criteria



##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java:
##########
@@ -192,8 +197,12 @@ public HoodieData<HoodieRecord<HoodieMetadataPayload>> getRecordsByKeyPrefixes(L
   }
 
   @Override
-  public List<Pair<String, Option<HoodieRecord<HoodieMetadataPayload>>>> getRecordsByKeys(List<String> keys,
+  public List<Pair<String, Option<HoodieRecord<HoodieMetadataPayload>>>> getRecordsByKeys(List<String> keysUnsorted,

Review Comment:
   Same as above



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org