You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "shenxingwuying (Jira)" <ji...@apache.org> on 2023/02/07 07:19:00 UTC

[jira] [Updated] (KUDU-3446) I think we should talk about CommitMsg's order in WAL

     [ https://issues.apache.org/jira/browse/KUDU-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

shenxingwuying updated KUDU-3446:
---------------------------------
    Summary: I think we should talk about CommitMsg's order in WAL  (was: I think we should talk about CommitMsg's order)

> I think we should talk about CommitMsg's order in WAL
> -----------------------------------------------------
>
>                 Key: KUDU-3446
>                 URL: https://issues.apache.org/jira/browse/KUDU-3446
>             Project: Kudu
>          Issue Type: Improvement
>            Reporter: shenxingwuying
>            Assignee: shenxingwuying
>            Priority: Major
>
> h1. Background
> In kudu, kudu's WAL' records has two types, one is 'replicate', the other is 'commit'. The 'replcate' log is the raft logs, the 'commit' logs is durability for the applied opid on kudu storage engine.
> Currently, appling the ops using 'apply_pool->Submit()'(i.e concurrent thread-pool),
> the apply task mainly run the following statements:
>  
> {code:java}
> // op_driver.cc
> apply_pool_->Submit([this]() { this->ApplyTask(); });
> OpDriver::ApplyTask() {
>     CommitMsg* commit_msg; 
>     Status s = op_->Apply(&commit_msg);
>     log_->AsyncAppendCommit(*commit_msg, ...
> } {code}
> apply_pool_ is an concurrent thread-pool, ApplyTask is concurrent, so some raft logs statifys happen-before ralationship, it may not statisfies apply them into kudu storage engine.
> For example, 4 logs of 2 ops, we expected:
> replicate 1.1
> commit 1.1
> replicate 1.2
> commit 1.2
> or
> replicate 1.1
> replicate 1.2
> commit 1.1
> commit 1.2
> A incorrect order(IMO) is:
> replicate 1.1
> replicate 1.2
> commit 1.2
> commit 1.1
> Currently, it's valid in kudu system, kudu system allow the order and some test cases and bootstrap's processing can reflect this.
> But that means 1.2 would become valid before 1.1 in kudu engine in a very high probability, that may be not expected.
>  
>  
> It's simple to reproduce the scenarios if there is enough WriteRequests. I will write a test for this.
> I obtain a case like this:
> ./bin/kudu wal dump $wal_file | egrep "REPLICATE|COMMIT" | less
> 1.75939@6812005919066001408 REPLICATE WRITE_OP
> 1.75940@6812005919066857472 REPLICATE WRITE_OP
> 1.75941@6812005919067430912 REPLICATE WRITE_OP
> COMMIT 1.75939
> COMMIT 1.75941
> COMMIT 1.75940
> 1.75942@6812005919193690112 REPLICATE WRITE_OP
> COMMIT 1.75942
> 1.75943@6812005919311241216 REPLICATE WRITE_OP
> 1.75944@6812005919312207872 REPLICATE WRITE_OP
> 1.75945@6812005919312932864 REPLICATE WRITE_OP
> 1.75946@6812005919313645568 REPLICATE WRITE_OP
> COMMIT 1.75943
> COMMIT 1.75945
> COMMIT 1.75944
> COMMIT 1.75946
> 1.75947@6812005919354585088 REPLICATE WRITE_OP
> COMMIT 1.75947
> 1.75948@6812005919430410240 REPLICATE WRITE_OP
> 1.75949@6812005919431192576 REPLICATE WRITE_OP
> 1.75950@6812005919431778304 REPLICATE WRITE_OP
> COMMIT 1.75948
> COMMIT 1.75950
> COMMIT 1.75949
> we can see the COMMIT:
> COMMIT 1.75939
> COMMIT 1.75941
> COMMIT 1.75940
> and
> COMMIT 1.75943
> COMMIT 1.75945
> COMMIT 1.75944
> and
> COMMIT 1.75948
> COMMIT 1.75950
> COMMIT 1.75949
> h1. Motivation
> I think the correct order should statisfy the invariable
> r: replicate
> c: commit
> e[i]: a pair replicate and commit op for index i.
>  # r(e[i]) < r(e[i+1]) its raft's requirement
>  # r(e[i]) < c(e[i] its obvious
>  # c(e[i]) < c(e[i+1]) should same as 1.
> The raft logs is an total order on server side, kudu storage engine is the state machine and the applied order should same as raft logs.
> h1. Solution
> I think we should use a 'apply_pool_token_' with SERIAL_MODE
> created by apply_pool_ instead of 'apply_pool_'. If we do this, some cases should fix at the same time.
>  
> We should talk about the words what I described above firstly and  whether is it correct?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)