You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Yue Zhang (Jira)" <ji...@apache.org> on 2021/11/30 08:08:00 UTC
[jira] [Created] (HUDI-2892) Pending Clustering may stain the ActiveTimeLine and lead to incomplete query results
Yue Zhang created HUDI-2892:
-------------------------------
Summary: Pending Clustering may stain the ActiveTimeLine and lead to incomplete query results
Key: HUDI-2892
URL: https://issues.apache.org/jira/browse/HUDI-2892
Project: Apache Hudi
Issue Type: Bug
Reporter: Yue Zhang
Step 1
Do a normal hudi insert
drwxr-xr-x 3 yuezhang FREEWHEELMEDIA\Domain Users 96 11 30 11:39 .aux/
drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:39 .temp/
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:39 20211130113918979.commit
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39 20211130113918979.commit.requested
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39 20211130113918979.inflight
drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:39 archived/
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 553 11 30 11:39 hoodie.properties
Step 2
Build a clustering plan but don't execute this plan
20211130114103632.replacecommit.requested will cluster data files from 20211130113918979.commit
drwxr-xr-x 3 yuezhang FREEWHEELMEDIA\Domain Users 96 11 30 11:39 .aux/
drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:39 .temp/
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:39 20211130113918979.commit
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39 20211130113918979.commit.requested
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39 20211130113918979.inflight
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 2976 11 30 11:41 20211130114103632.replacecommit.requested
drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:39 archived/
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 553 11 30 11:39 hoodie.properties
Step 3
Do a few times hudi insert and trigger several archivals
drwxr-xr-x 3 yuezhang FREEWHEELMEDIA\Domain Users 96 11 30 11:39 .aux/
drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:44 .temp/
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:39 20211130113918979.commit
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39 20211130113918979.commit.requested
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:39 20211130113918979.inflight
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 2976 11 30 11:41 20211130114103632.replacecommit.requested
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:41 20211130114122881.commit
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:41 20211130114122881.commit.requested
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:41 20211130114122881.inflight
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:42 20211130114207164.commit
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:42 20211130114207164.commit.requested
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:42 20211130114207164.inflight
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 11:44 20211130114351703.commit
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:43 20211130114351703.commit.requested
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 11:43 20211130114351703.inflight
drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 11:39 archived/
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 553 11 30 11:39 hoodie.properties
drwxr-xr-x 3 yuezhang FREEWHEELMEDIA\Domain Users 96 11 30 13:17 .aux/
drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 13:23 .temp/
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 2976 11 30 13:17 20211130114103632.replacecommit.requested
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:18 20211130131825336.commit
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:18 20211130131825336.commit.requested
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:18 20211130131825336.inflight
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:23 20211130132256488.commit
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:22 20211130132256488.commit.requested
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:22 20211130132256488.inflight
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:23 20211130132327154.commit
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:23 20211130132327154.commit.requested
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:23 20211130132327154.inflight
drwxr-xr-x 6 yuezhang FREEWHEELMEDIA\Domain Users 192 11 30 13:23 archived/
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 553 11 30 13:17 hoodie.properties
20211130114122881.commit 20211130114207164.commit and 20211130114351703.commit were archived.
Step 4
Do query to check record numbers and based hudi data files.
val frame = spark.sql("select count(*) from hudi_test").show(10000, false)
=>
+--------+
|count(1)|
+--------+
|4217794 |
+--------+
val frame = spark.sql("select distinct(_hoodie_file_name) from hudi_test").show(10000, false)
=>
+----------------------------------------------------------------------+
|_hoodie_file_name |
+----------------------------------------------------------------------+
|caef07aa-087a-42ed-b61f-a0999fc588e8-0_1-8-0_20211130132327154.parquet|
|12f0e65c-9cd8-470f-b4f1-ec4815d9af0a-0_1-8-0_20211130114122881.parquet|
|ac474457-c656-4fff-ac07-7ddd1746f4cf-0_1-8-0_20211130113918979.parquet|
|73babec7-10f6-4b76-84d8-b80d629c222a-0_0-7-0_20211130131825336.parquet|
|a99ffa3b-34e7-4ccf-bedc-a169c717c1d8-0_0-7-0_20211130113918979.parquet|
|7978966e-0874-4809-b9ca-4a88d73ab373-0_1-8-0_20211130131825336.parquet|
|823e0eef-e24a-400c-878d-4c26d4db5994-0_0-7-0_20211130114207164.parquet|
|00295c50-6551-49a7-8ac4-da4d0bd33048-0_0-7-0_20211130132327154.parquet|
|a2aa3997-809b-479d-839e-9291b7b6e9d4-0_0-7-0_20211130132256488.parquet|
|eb149360-a1ba-4236-93a0-85425e86b70c-0_1-8-0_20211130114207164.parquet|
|b06b3beb-5bd7-4756-b961-37c558e35625-0_0-7-0_20211130114351703.parquet|
|d9a5947a-a8d7-44d7-9d74-dbc174d7a326-0_1-8-0_20211130132256488.parquet|
|9e610a31-1b85-41f0-b304-70ca154a5011-0_0-7-0_20211130114122881.parquet|
|4b29a4bc-cb2b-4024-85a6-e07601d86334-0_1-8-0_20211130114351703.parquet|
+----------------------------------------------------------------------+
Step 5
Stop insert and trigger that pending clustering replace request
drwxr-xr-x 3 yuezhang FREEWHEELMEDIA\Domain Users 96 11 30 13:17 .aux/
drwxr-xr-x 2 yuezhang FREEWHEELMEDIA\Domain Users 64 11 30 13:27 .temp/
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 4736 11 30 13:27 20211130114103632.replacecommit
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:27 20211130114103632.replacecommit.inflight
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 2976 11 30 13:17 20211130114103632.replacecommit.requested
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:18 20211130131825336.commit
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:18 20211130131825336.commit.requested
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:18 20211130131825336.inflight
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:23 20211130132256488.commit
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:22 20211130132256488.commit.requested
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:22 20211130132256488.inflight
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 5485 11 30 13:23 20211130132327154.commit
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:23 20211130132327154.commit.requested
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 0 11 30 13:23 20211130132327154.inflight
drwxr-xr-x 6 yuezhang FREEWHEELMEDIA\Domain Users 192 11 30 13:23 archived/
-rw-r--r-- 1 yuezhang FREEWHEELMEDIA\Domain Users 553 11 30 13:17 hoodie.properties
Step 6
Do the same queries to check record numbers and based hudi data files.
val frame = spark.sql("select count(*) from hudi_test").show(10000, false)
=>
+--------+
|count(1)|
+--------+
|2410168 |
+--------+
val frame = spark.sql("select distinct(_hoodie_file_name) from hudi_test").show(10000, false)
=>
+----------------------------------------------------------------------+
|_hoodie_file_name |
+----------------------------------------------------------------------+
|caef07aa-087a-42ed-b61f-a0999fc588e8-0_1-8-0_20211130132327154.parquet|
|ac474457-c656-4fff-ac07-7ddd1746f4cf-0_1-8-0_20211130113918979.parquet|
|73babec7-10f6-4b76-84d8-b80d629c222a-0_0-7-0_20211130131825336.parquet|
|a99ffa3b-34e7-4ccf-bedc-a169c717c1d8-0_0-7-0_20211130113918979.parquet|
|7978966e-0874-4809-b9ca-4a88d73ab373-0_1-8-0_20211130131825336.parquet|
|00295c50-6551-49a7-8ac4-da4d0bd33048-0_0-7-0_20211130132327154.parquet|
|a2aa3997-809b-479d-839e-9291b7b6e9d4-0_0-7-0_20211130132256488.parquet|
|d9a5947a-a8d7-44d7-9d74-dbc174d7a326-0_1-8-0_20211130132256488.parquet|
+----------------------------------------------------------------------+
As we can see, we get different query result compared with before-clustering and after-clustering.
Also query result from Step 6 is missing records from these base file mentioned below.
|12f0e65c-9cd8-470f-b4f1-ec4815d9af0a-0_1-8-0_20211130114122881.parquet|
|9e610a31-1b85-41f0-b304-70ca154a5011-0_0-7-0_20211130114122881.parquet|
|823e0eef-e24a-400c-878d-4c26d4db5994-0_0-7-0_20211130114207164.parquet|
|eb149360-a1ba-4236-93a0-85425e86b70c-0_1-8-0_20211130114207164.parquet|
|b06b3beb-5bd7-4756-b961-37c558e35625-0_0-7-0_20211130114351703.parquet|
|4b29a4bc-cb2b-4024-85a6-e07601d86334-0_1-8-0_20211130114351703.parquet|
The root cause of this incomplete query results is that late finished clustering instant stain this activeTimeLine hoodie get wrong latest base file according to
https://github.com/apache/hudi/blob/55ecbc662e30068ce0ed49166d254202bd598a8c/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieFileGroup.java#L120
To fix this bug, we need let pending clustering instant to block archive action like pending compaction did.
P.S.
Each ingestion will insert 602,542 records.
20211130114103632.replacecommit
{
"partitionToWriteStats" : {
"20210623" : [ {
"fileId" : "9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0",
"path" : "20210623/9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0_0-1-2_20211130114103632.parquet",
"prevCommit" : "null",
"numWrites" : 602542,
"numDeletes" : 0,
"numUpdateWrites" : 0,
"numInserts" : 602542,
"totalWriteBytes" : 17645296,
"totalWriteErrors" : 0,
"tempPath" : null,
"partitionPath" : "20210623",
"totalLogRecords" : 0,
"totalLogFilesCompacted" : 0,
"totalLogSizeCompacted" : 0,
"totalUpdatedRecordsCompacted" : 0,
"totalLogBlocks" : 0,
"totalCorruptLogBlock" : 0,
"totalRollbackBlocks" : 0,
"fileSizeInBytes" : 17645296,
"minEventTime" : null,
"maxEventTime" : null
} ]
},
"compacted" : false,
"extraMetadata" : {
"schema" : "xxxxx"
},
"operationType" : "CLUSTER",
"partitionToReplaceFileIds" : {
"20210623" : [ "ac474457-c656-4fff-ac07-7ddd1746f4cf-0", "a99ffa3b-34e7-4ccf-bedc-a169c717c1d8-0" ]
},
"fileIdAndRelativePaths" : {
"9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0" : "20210623/9656f4c5-76f2-49d3-ae50-600bdcbc43b3-0_0-1-2_20211130114103632.parquet"
},
"totalRecordsDeleted" : 0,
"totalLogRecordsCompacted" : 0,
"totalLogFilesCompacted" : 0,
"totalCompactedRecordsUpdated" : 0,
"totalLogFilesSize" : 0,
"totalScanTime" : 0,
"totalCreateTime" : 11053,
"totalUpsertTime" : 0,
"minAndMaxEventTime" : {
"Optional.empty" : {
"val" : null,
"present" : false
}
},
"writePartitionPaths" : [ "20210623" ]
}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)