You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Yue Zhang (Jira)" <ji...@apache.org> on 2021/11/30 08:08:00 UTC

[jira] [Created] (HUDI-2892) Pending Clustering may stain the ActiveTimeLine and lead to incomplete query results

Yue Zhang created HUDI-2892:
-------------------------------

             Summary: Pending Clustering may stain the ActiveTimeLine and lead to incomplete query results
                 Key: HUDI-2892
                 URL: https://issues.apache.org/jira/browse/HUDI-2892
             Project: Apache Hudi
          Issue Type: Bug
            Reporter: Yue Zhang


 

Step 1 
Do a normal hudi insert 

drwxr-xr-x   3 yuezhang  FREEWHEELMEDIA\Domain Users    96 11 30 11:39 .aux/
drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 11:39 .temp/
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 11:39 20211130113918979.commit
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:39 20211130113918979.commit.requested
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:39 20211130113918979.inflight
drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 11:39 archived/
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users   553 11 30 11:39 hoodie.properties

Step 2 
Build a clustering plan but don't execute this plan
20211130114103632.replacecommit.requested will cluster data files from 20211130113918979.commit

drwxr-xr-x   3 yuezhang  FREEWHEELMEDIA\Domain Users    96 11 30 11:39 .aux/
drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 11:39 .temp/
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 11:39 20211130113918979.commit
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:39 20211130113918979.commit.requested
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:39 20211130113918979.inflight
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  2976 11 30 11:41 20211130114103632.replacecommit.requested
drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 11:39 archived/
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users   553 11 30 11:39 hoodie.properties

Step 3 
Do a few times hudi insert and trigger several archivals

drwxr-xr-x   3 yuezhang  FREEWHEELMEDIA\Domain Users    96 11 30 11:39 .aux/
drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 11:44 .temp/
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 11:39 20211130113918979.commit
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:39 20211130113918979.commit.requested
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:39 20211130113918979.inflight
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  2976 11 30 11:41 20211130114103632.replacecommit.requested
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 11:41 20211130114122881.commit
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:41 20211130114122881.commit.requested
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:41 20211130114122881.inflight
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 11:42 20211130114207164.commit
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:42 20211130114207164.commit.requested
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:42 20211130114207164.inflight
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 11:44 20211130114351703.commit
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:43 20211130114351703.commit.requested
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 11:43 20211130114351703.inflight
drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 11:39 archived/
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users   553 11 30 11:39 hoodie.properties

drwxr-xr-x   3 yuezhang  FREEWHEELMEDIA\Domain Users    96 11 30 13:17 .aux/
drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 13:23 .temp/
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  2976 11 30 13:17 20211130114103632.replacecommit.requested
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 13:18 20211130131825336.commit
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:18 20211130131825336.commit.requested
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:18 20211130131825336.inflight
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 13:23 20211130132256488.commit
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:22 20211130132256488.commit.requested
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:22 20211130132256488.inflight
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 13:23 20211130132327154.commit
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:23 20211130132327154.commit.requested
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:23 20211130132327154.inflight
drwxr-xr-x   6 yuezhang  FREEWHEELMEDIA\Domain Users   192 11 30 13:23 archived/
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users   553 11 30 13:17 hoodie.properties


20211130114122881.commit 20211130114207164.commit and 20211130114351703.commit were archived.


Step 4 
Do query to check record numbers and based hudi data files.
val frame = spark.sql("select count(*) from hudi_test").show(10000, false)
=> 
+--------+
|count(1)|
+--------+
|4217794 |
+--------+

val frame = spark.sql("select distinct(_hoodie_file_name) from hudi_test").show(10000, false)
=>
+----------------------------------------------------------------------+
|_hoodie_file_name                                                     |
+----------------------------------------------------------------------+
|caef07aa-087a-42ed-b61f-a0999fc588e8-0_1-8-0_20211130132327154.parquet|
|12f0e65c-9cd8-470f-b4f1-ec4815d9af0a-0_1-8-0_20211130114122881.parquet|
|ac474457-c656-4fff-ac07-7ddd1746f4cf-0_1-8-0_20211130113918979.parquet|
|73babec7-10f6-4b76-84d8-b80d629c222a-0_0-7-0_20211130131825336.parquet|
|a99ffa3b-34e7-4ccf-bedc-a169c717c1d8-0_0-7-0_20211130113918979.parquet|
|7978966e-0874-4809-b9ca-4a88d73ab373-0_1-8-0_20211130131825336.parquet|
|823e0eef-e24a-400c-878d-4c26d4db5994-0_0-7-0_20211130114207164.parquet|
|00295c50-6551-49a7-8ac4-da4d0bd33048-0_0-7-0_20211130132327154.parquet|
|a2aa3997-809b-479d-839e-9291b7b6e9d4-0_0-7-0_20211130132256488.parquet|
|eb149360-a1ba-4236-93a0-85425e86b70c-0_1-8-0_20211130114207164.parquet|
|b06b3beb-5bd7-4756-b961-37c558e35625-0_0-7-0_20211130114351703.parquet|
|d9a5947a-a8d7-44d7-9d74-dbc174d7a326-0_1-8-0_20211130132256488.parquet|
|9e610a31-1b85-41f0-b304-70ca154a5011-0_0-7-0_20211130114122881.parquet|
|4b29a4bc-cb2b-4024-85a6-e07601d86334-0_1-8-0_20211130114351703.parquet|
+----------------------------------------------------------------------+


Step 5 
Stop insert and trigger that pending clustering replace request

drwxr-xr-x   3 yuezhang  FREEWHEELMEDIA\Domain Users    96 11 30 13:17 .aux/
drwxr-xr-x   2 yuezhang  FREEWHEELMEDIA\Domain Users    64 11 30 13:27 .temp/
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  4736 11 30 13:27 20211130114103632.replacecommit
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:27 20211130114103632.replacecommit.inflight
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  2976 11 30 13:17 20211130114103632.replacecommit.requested
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 13:18 20211130131825336.commit
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:18 20211130131825336.commit.requested
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:18 20211130131825336.inflight
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 13:23 20211130132256488.commit
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:22 20211130132256488.commit.requested
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:22 20211130132256488.inflight
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users  5485 11 30 13:23 20211130132327154.commit
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:23 20211130132327154.commit.requested
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users     0 11 30 13:23 20211130132327154.inflight
drwxr-xr-x   6 yuezhang  FREEWHEELMEDIA\Domain Users   192 11 30 13:23 archived/
-rw-r--r--   1 yuezhang  FREEWHEELMEDIA\Domain Users   553 11 30 13:17 hoodie.properties


Step 6 
Do the same queries to check record numbers and based hudi data files.

val frame = spark.sql("select count(*) from hudi_test").show(10000, false)
=> 
+--------+
|count(1)|
+--------+
|2410168 |
+--------+

val frame = spark.sql("select distinct(_hoodie_file_name) from hudi_test").show(10000, false)
=>
+----------------------------------------------------------------------+
|_hoodie_file_name                                                     |
+----------------------------------------------------------------------+
|caef07aa-087a-42ed-b61f-a0999fc588e8-0_1-8-0_20211130132327154.parquet|
|ac474457-c656-4fff-ac07-7ddd1746f4cf-0_1-8-0_20211130113918979.parquet|
|73babec7-10f6-4b76-84d8-b80d629c222a-0_0-7-0_20211130131825336.parquet|
|a99ffa3b-34e7-4ccf-bedc-a169c717c1d8-0_0-7-0_20211130113918979.parquet|
|7978966e-0874-4809-b9ca-4a88d73ab373-0_1-8-0_20211130131825336.parquet|
|00295c50-6551-49a7-8ac4-da4d0bd33048-0_0-7-0_20211130132327154.parquet|
|a2aa3997-809b-479d-839e-9291b7b6e9d4-0_0-7-0_20211130132256488.parquet|
|d9a5947a-a8d7-44d7-9d74-dbc174d7a326-0_1-8-0_20211130132256488.parquet|
+----------------------------------------------------------------------+


As we can see, we get different query result compared with before-clustering and after-clustering.
Also query result from Step 6 is missing records from these base file mentioned below.

|12f0e65c-9cd8-470f-b4f1-ec4815d9af0a-0_1-8-0_20211130114122881.parquet|
|9e610a31-1b85-41f0-b304-70ca154a5011-0_0-7-0_20211130114122881.parquet|

|823e0eef-e24a-400c-878d-4c26d4db5994-0_0-7-0_20211130114207164.parquet|
|eb149360-a1ba-4236-93a0-85425e86b70c-0_1-8-0_20211130114207164.parquet|

|b06b3beb-5bd7-4756-b961-37c558e35625-0_0-7-0_20211130114351703.parquet|
|4b29a4bc-cb2b-4024-85a6-e07601d86334-0_1-8-0_20211130114351703.parquet|

 

The root cause of this incomplete query results is that late finished clustering instant stain this activeTimeLine hoodie get wrong latest base file according to 
https://github.com/apache/hudi/blob/55ecbc662e30068ce0ed49166d254202bd598a8c/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieFileGroup.java#L120

To fix this bug, we need let pending clustering instant to block archive action like pending compaction did.

P.S.
Each ingestion will insert 602,542 records.

20211130114103632.replacecommit
{
  "partitionToWriteStats" : {
    "20210623" : [ {
      "fileId" : "9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0",
      "path" : "20210623/9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0_0-1-2_20211130114103632.parquet",
      "prevCommit" : "null",
      "numWrites" : 602542,
      "numDeletes" : 0,
      "numUpdateWrites" : 0,
      "numInserts" : 602542,
      "totalWriteBytes" : 17645296,
      "totalWriteErrors" : 0,
      "tempPath" : null,
      "partitionPath" : "20210623",
      "totalLogRecords" : 0,
      "totalLogFilesCompacted" : 0,
      "totalLogSizeCompacted" : 0,
      "totalUpdatedRecordsCompacted" : 0,
      "totalLogBlocks" : 0,
      "totalCorruptLogBlock" : 0,
      "totalRollbackBlocks" : 0,
      "fileSizeInBytes" : 17645296,
      "minEventTime" : null,
      "maxEventTime" : null
    } ]
  },
  "compacted" : false,
  "extraMetadata" : {
    "schema" : "xxxxx"
  },
  "operationType" : "CLUSTER",
  "partitionToReplaceFileIds" : {
    "20210623" : [ "ac474457-c656-4fff-ac07-7ddd1746f4cf-0", "a99ffa3b-34e7-4ccf-bedc-a169c717c1d8-0" ]
  },
  "fileIdAndRelativePaths" : {
    "9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0" : "20210623/9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0_0-1-2_20211130114103632.parquet"
  },
  "totalRecordsDeleted" : 0,
  "totalLogRecordsCompacted" : 0,
  "totalLogFilesCompacted" : 0,
  "totalCompactedRecordsUpdated" : 0,
  "totalLogFilesSize" : 0,
  "totalScanTime" : 0,
  "totalCreateTime" : 11053,
  "totalUpsertTime" : 0,
  "minAndMaxEventTime" : {
    "Optional.empty" : {
      "val" : null,
      "present" : false
    }
  },
  "writePartitionPaths" : [ "20210623" ]
}

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)