You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@iotdb.apache.org by GitBox <gi...@apache.org> on 2021/09/13 09:20:33 UTC

[GitHub] [iotdb] chengjianyun opened a new issue #3954: Integrate Apache Ratis to help manage Raft status

chengjianyun opened a new issue #3954:
URL: https://github.com/apache/iotdb/issues/3954


   We have some discussions about the integration of Apache Ratis in RP: https://github.com/apache/iotdb/pull/3939.
   
   Create a new issue here to let more people join.
   
   https://github.com/apache/iotdb/pull/3939#issuecomment-917955815
   
   https://github.com/apache/iotdb/pull/3939#issuecomment-917961701
   
   https://github.com/apache/iotdb/pull/3939#issuecomment-917983405


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] jt2594838 commented on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

jt2594838 commented on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-920637452


   By the way, I checked the recent Jira issues that are related to the cluster module, only to find very, very few of them are related to the consensus implementation. So I sincerely doubt merely changing to another consensus implementation can solve any problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] jt2594838 commented on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

jt2594838 commented on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-920845826


   The listed issues concentrate on raft log management, which is already a known issue that needs refactoring, and it only contributes to a small part of the whole raft framework. While there are about 100 issues concerning the usage over raft. It is much more reasonable to think switching to a new framework will cause significantly more issues than fixing the current one.
   
   The tests of raft is merely a small part of the whole system, as the algorithm itself is quite simple and is well explained in the paper. What really matters is how you integrate raft with your own business, which, very unfortunately, cannot be helped no matter what framework you use. So if you think using an existing framework could save a lot of tests, that is not the case.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] chengjianyun commented on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

chengjianyun commented on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-918736384


   > The uncommitted Raft logs are not persisted. Consider a scenario where node A is the leader, nodes B and C are followers, node A synchronizes the log to B and C and gets all acks, then submits and applies the log, and then returns success to the client before the next heartbeat to followers. Then there was a momentary power outage, nodes B and C restarted immediately, but node A restarted slowly. Node B and C's log are empty after restarting and recovering, but they are already a majority, so they can serve the client's read request, then this may violate linearizability.
   
   Agree that the case is a bug for IoTDB.
   
   > In fact, to ensure linearizability Raft uses read-index or lease-read. In our current implementation, direct-read is used for the leader and read-index is used for the follower. This can violate linearizability when a node outage causes a replacement node to execute the same read request. For more specific examples, you can refer to my blog. Of course, we could also use read-index on the leader, but this would undoubtedly degrade performance.
   
   Sorry that I didn't make the assumption clear. I think we are talking things under IoTDB environment. So I didn't take Read operation into consideration. As you have said, linearization the read is not a regular operation in OLAP system. 
   
   And let's go back to Raft library topic. Import the Raft library doesn't mean application needs to sync all operation via the library, it depends on requirement. I don't think IoTDB needs to change current read action process flow after import the library. 
   
   And learned a lot from your blog and references, thanks.
   
   > In fact, Raft's naive implementation guarantees at-least-once semantics, and to ensure linearizability semantics, uuid is generated on the client side and a map is logged on the server side to ensure that each command is executed only once. 
   
   Agree the concept but which is unnecessary under IoTDB or time-order data environment. I think the `exactly-only-once` requirement here as the actions are not idempotent for all system. But operations in time-order-data or ``most of actions in IoTDB (in my understanding all of actions right now) are idempotent naturally. And again, all of these are in scope of application instead of Raft. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] chengjianyun edited a comment on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

chengjianyun edited a comment on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-919788803


   > Linearizability consists of two aspects: consensus and query implementation, we may have spent some efforts on consensus, but little on query implementation, and you should not blame one for another. Meanwhile, queries in IoTDB may be very different from those in systems like etcd, and the concern of query implementation could be also different.
   
   Can't agree more.
   
   > Relying on another library for core functionalities is risky. Other projects keep bug fixing and refactoring too, it is hard to say if we can catch up with them after some time point. And it would become very difficult to implement any improvements on it, which may eventually make IoTDB lack of competitiveness, as the most important advantage of current IoTDB lies on its superior performance, we would completely lose control if the consensus part becomes a bottleneck. Ans should any major bug occur, we will have to wait another team to fix it, because fixing it ourselves will create forks.
   
   The raft libraries often only provides the ability of consensus on Raft log, it doesn't affect business logic much and doesn't force everything is linearization. With proper 3rd party library, we could have enough flexibility to optimize and improvement performance. The production 3rd party libraries should have been verified by some great productions so that we can believe no major bugs happen.
   
   > Also from my personal point of view, I would not hope IoTDB to just become an integration of libraries, without owning its core techniques, which will make us easily overtaken by those who are more willing to challenge.
   > 
   > Above all, most importantly, before you turn to another method, you should answer the following questions:
   > 
   > 1. What is the problem of the current method, do I have any proof?
   
   Updated in description of the issue.
   > 2. Is the new method really better than the current one, how do I show it?
   
   From view of availability, maintainability and code structure, I think yes, the new method would be better than current one. For performance, I'm not sure as it needs experiment. But I think performance lose will be small or no performance lose. Because current implementation of Raft doesn't apply most of optimizations which has been made in other system such as async append, async apply, etc.
   > 3. Is switching to the new method a better choice than improving the current one, is it really that easy?
   
   The safety properties is too agile to guarantee without enough tests for Raft. Building up a systematic test cases is really hard and looks is impossible under currently implementation(Raft and Business logic mix together). Of course, integrate the new library means we may change current structure, the effort can be evaluated after investigation. 
   >    Being doubtful is good, but solid evidence is more convincing.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] jt2594838 edited a comment on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

jt2594838 edited a comment on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-920883859


   My suggestion is, please give us a list of things your preferred frameworks used to qualify their raft implementation (tests to be specific, and please restrict to raft only). Then we will discuss whether it is easier to add the missing tests or directly using their implementation. I think this is much more convincing than saying "I do not know", "I worry", "I doubt".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] chengjianyun edited a comment on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

chengjianyun edited a comment on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-918015757


   Reply https://github.com/apache/iotdb/pull/3939#issuecomment-917983405, @LebronAl 
   
   > Production-level raft algorithms generally guarantee linearizability, which is the strictest consistency in distributed systems. Our current Raft algorithm may not be complete, so strictly speaking it does not guarantee linearizability, but at the same time it's performance may be better. What would we say if we migrated raft implementation and performance dropped? Furthermore, is linearizability really necessary for OLAP databases like time-series databases?
   
   What do you mean by "strictly speaking it does not guarantee linearizability"? In which case, IoTDB doesn't guarantee the linearizability? In my opinion, it doesn't matter IoTDB or other products, if you are using Raft, that means you need the linearizability which is what raft provided. If IoTDB doesn't guarantee the linearizability, I think it's a bug. Right now, I didn't find any of the case. 
   
   I agree that we can simply some aspects of Raft implementation but which won't affect performance much. The main performance improvement relays on the multi-raft support.
   
   > At the moment we have Raft implementation and business logic mixed together, but there is a certain amount of performance improvement. For example, we currently put data from multiple storage groups into one Raft group to be executed synchronously. In fact, at the bottom, they can be executed in parallel. So we implemented parallel asynchronous apply optimization so that plans from the same Raft group but different storage groups can be applied in parallel. The performance gains from this optimization are significant. How should we handle this case after we migrate raft implementation? Add parallel apply feature to raft library Or do we change our partitioning pattern so that a Raft group serves only one storage group? Or just let them execute inefficiently?
   
   > "For example, we currently put data from multiple storage groups into one Raft group to be executed synchronously."
   Could you explain this one more detail? 
   
   > Add parallel apply feature to raft library Or do we change our partitioning pattern so that a Raft group serves only one storage group?
   
   Application(Raft user) takes care of log apply, so it doesn't matter how Raft implemented.
   
   Finally, in my opinion, current high through-put mainly gains from multi-raft support. So it won't affect performance too much after integrate Ratis.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] cigarl commented on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

cigarl commented on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-918777422


   > But the key sentence is this. Even if the bug is fixed. Should we also call async every time when we flush log? 
   
   Agree with you, in the IOT world, data may not be as sensitive,maybe we can sacrifice security _(in some cases)_ to improve performance _(we can even provide configurations to meet different needs of users)_.But for now, this approach seems to pose a greater danger _(eg. cluster can not be restarted successfully, or enter an unstable state, returning confusing results to the user, etc)_ in corner cases.
   
   BTW,maybe we should comb through some optimizations to compare.I'm so sorry that I seem to see only that the strategy of flushing to disk has an effect, the impact at the Raft level doesn't seem to be as large as expected.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] jixuan1989 commented on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

jixuan1989 commented on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-920919425

Hi folks,

Very glad to see the deep discussion about this topic.

Actually we had several similar discussions in the last 2 years. Even last year, I did not give up trying to use Ratis [1]. However, we finally give up the solution.

Some guys may read the paper "Apache IoTDB 分布式架构设计" [2]. The paper is written in 2019, and at that time, we used sofa-jraft[3], and we should release the cluster version at that time but finally, we gave up. The experience told us using an existing implementation is not as beautiful as we thought.
In that version, I asked many times "why you design classes and codes like this", the answer is always "I have to because of the underlying implementation".

We finally blame the performance problem and architecture scalability problem on the third-part library. I know the conclusion may be not correct, but it is hard to guarantee whether the history will replay again...

What is more important, Raft and multi-raft are not our destinations. We believe in the time series management scenario, Raft is over-heavy. Using an existing implementation will limit us finally. and I do not want to re-implement the cluster module once again at that time...

I know our implementation may have many bugs and we omit many corner cases now. But it may not be a bad thing. The progress will let us understand the system deeply.

IMO, we really need to:
1. if we can decouple the implementation, it is better. (But it seems there is few things to do according to Tian Jiang's above opinion)
2. If we consider others have been fully tested, we can learn their test cases to make our implementation more robost.

> The answer must be YES, but WHEN? One year or two?

When writing here, I remember another discussion: when we want to implement the monitoring module for IoTDB, which 3rd library should we use? micrometer, dropwizard or others? Some of them have rich features but some of them have better performance. So, how to choose? Finally, we decide to extract a framework and let users choose what they want to use.

So, if we can do some refactor and then allow let users to switch the implementation, I can accept that.

But it is hard to convince me to give up our implementation. In my view, it is a kind of drinking the poison to kill the thirsty.

[1] https://lists.apache.org/thread.html/3c49bc234415417511226c7032009868f9d3475026adf07e2e41f99f%40%3Cdev.ratis.apache.org%3E
[2] http://scis.scichina.com/cn/2020/SSI-2019-0189.pdf
[3] https://github.com/sofastack/sofa-jraft

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] LebronAl edited a comment on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

LebronAl edited a comment on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-918163937

> What do you mean by "strictly speaking it does not guarantee linearizability"? In which case, IoTDB doesn't guarantee the linearizability? In my opinion, it doesn't matter IoTDB or other products, if you are using Raft, that means you need the linearizability which is what raft provided. If IoTDB doesn't guarantee the linearizability, I think it's a bug. Right now, I didn't find any of the case.

The devil of consensus lies in the corner case. So let me give you three examples that I know so far that may violate linearizability.
1. The uncommitted Raft logs are not persisted. Consider a scenario where node A is the leader, nodes B and C are followers, node A synchronizes the log to B and C and gets all acks, then submits and applies the log, and then returns success to the client before the next heartbeat to followers. Then there was a momentary power outage, nodes B and C restarted immediately, but node A restarted slowly. Node B and C's log are empty after restarting and recovering, but they are already a majority, so they can serve the client's read request, then this may violate linearizability. Of course, this is strictly an implementation bug and can be fixed. But even if the bug is fixed, you can see that our raft log's serialization buffer refers to the implementation of the stand-alone WAL, it does not write to the disk and call async every time a log is written, which ensures that the performance will not be limited by the IOPS of the disk. But this may also cause the corner case mentioned abov
e. This is the trade-off between performance and safety.
2. In fact, to ensure linearizability Raft uses read-index or lease-read. In our current implementation, direct-read is used for the leader and read-index is used for the follower. This can violate linear consistency when a node outage causes a replacement node to execute the same read request. For more specific examples, you can refer to my [blog](https://tanxinyu.work/consistency-and-consensus/). Of course, we could also use read-index on the leader, but this would undoubtedly degrade performance.
3. In fact, Raft's naive implementation guarantees at-least-once semantics, and to ensure linearizability semantics, uuid is generated on the client side and a map is logged on the server side to ensure that each command is executed only once. You can refer to section 6.2 of [Raft's PhD thesis](https://web.stanford.edu/~ouster/cgi-bin/papers/OngaroPhD.pdf) and dragonboat's discussion on [Zhihu](https://www.zhihu.com/question/278551592).There is no doubt that such an implementation will also affect performance.

>Application(Raft user) takes care of log apply, so it doesn't matter how Raft implemented.

I hope so~Maybe we need to make a detailed investigation on `Ratis`

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] LebronAl edited a comment on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

LebronAl edited a comment on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-918800046

To conclude, let's start by investigating Java implementations of some raft libraries, such as Ratis, Sofajraft, etc.
In the process of exploring, I have come up with five aspects that need to be paid attention to, and you can add more:
1. How Raft libraries integrate with the state machine: is there a way like etcd to send committed logs back to the upper level for free processing (best so we can do our own optimizations including consistency model considerations)? Or just abstracting the apply interface and letting us override (which may not be very friendly).
2. Can policies for Raft log persistence be configured: forceSync, batch sync, buffer, etc.
3. Are some common Raft optimizations done: batching/ Pipeline /Async apply, which are very common and effective Raft optimizations.
4. What was done to support multi-raft: was there any pooling done, etc.? This may be of great help in massive raft groups.
5. Correctness: Is there a lot of UT/IT? Are there any TLA+ test/chaos test validation results? Whether there is a large number of production cases.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] jixuan1989 commented on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

jixuan1989 commented on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-920921609


   BTW, I always agree to provide a no-bug or no critical bug version to industrial users. It depends on all of your contributions. 
   I believe we can achieve that if all of us contribute together.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] LebronAl commented on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

LebronAl commented on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-920767517

> You may think it is easier to switch to a new frame than to improve the current one, but it is totally not the case. Implementing a distributed version involves much more than calling some methods in a library. Interface adaption, schema conversion, exception handling, cluster organization, and data distribution... there are so many to do.

The goal for all of us was to make the cluster module more stable: some poeple felt it was better to use a mature Raft library because Raft was hard to be implemented correctly, but I also observed [Kafka](https://github.com/apache/kafka/tree/trunk/raft) writing their own Raft instead of using other Raft libraries. Other people don't feel the need to make big changes right now because they don't seem to have any problems at the moment, but using a great Raft library like etcd frees up the consensus bottleneck of the entire cloud native.

Being pragmatic, we have to admit that even if we decided to use another raft library, it wouldn't have been possible in a month or two. Also, most of the bugs we've fixed so far have nothing to do with consensus. Therefore, the decision to move precipitously requires a great deal of risk.

So I suggest we go in three directions in parallel: research + refactoring + testing.

- Research: Follow my list of 5 concerns to see how other libraries are doing; Understanding these things will help us understand the Raft algorithm more deeply. Whether the Raft library is replaced or not, this is better for cluster IoTDB because the people who developed it know more about Raft.

- Refactoring: Since we've recently started refactoring cluster code, I thought we could refactor Raft code as well. Ideally, it should be a single module, like [Kafka](https://github.com/apache/kafka/tree/trunk/raft). For anyone who wants to change raft library, this is basically the process of changing library, and need to do some abstraction to change current Raft algorithm. For those who don't want to change raft library, doing so can improve code readability and make it easier to add more complex tests. After modularization, we can enumerate some performance comparisons and pros and cons before we discuss whether raft libraries need to be replaced. I don't think we would be as divergent as we are now.

- Testing: After nearly a year of maintaining cluster modules, I believe that most of the bugs fixed so far have nothing to do with consensus and will show up even replacing raft library. Of course, I'm not saying that the Raft we implemented currently had no bugs, it was probably due to a lack of testing and a lack of production cases. Therefore, I suggest that we can fully test the cluster module from now on, and according to the test results we can make the next step of judgment. In addition, I am currently investigating and designing cluster [chaos-test framework](https://gitlab.summer-ospp.ac.cn/summer2021/210070607). If everything goes well, I will have a chaos testing framework that is easy to deploy at the end of September, and we can also use this framework to test cluster‘s stability. Welcome to join me

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] neuyilan edited a comment on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

neuyilan edited a comment on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-920542383

I think the current focus is on the following points:

1. Can we implement a good and correct raft framework.

The answer must be **YES**, but **WHEN**? One year or two? With the development of other time-series databases, such as `influxdb_ iox`、`tdengine`、`matrixdb`. Their market share is increasing. As an industrial product, if they want to have their own competitiveness, they must have the characteristics of rapid iteration and stability. However, over the past year, there is no industrial version of cluster version of `IoTDB`. From this perspective, I prefer to build a usable and secure distributed `IoTDB` version first. No matter who uses the raft framework, and I think it is a faster way to integrate one that has been widely used in the industry than to implement it yourself.

2. How good is the performance of the raft framework implemented by yourself?
As far as we know, the performance improvement of our cluster version is as follows:

2.1 async apply；
2.2 The query does not follow the raft, and the consistency of the query can be determined according to the configuration.
But as far as I know, the apply function of `Apache Ratis `framework can be decided by users. Moreover, the read request of Ratis does not follow the raft, so I don't think the performance will degrade much.

3. What is a competitive product?
A safe, stable and widely used product is a competitive product. If there are no available products for users, what about competitiveness?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] chengjianyun commented on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

chengjianyun commented on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-921447329


   Hi, all
   
   Thanks for all your opinions and discussions. 
   
   I'd like to close the discussion in a few days. Based on current implementation I don't think we can integrate 3rd party Raft library and the debate here meaningless right now. But no matter we finally decide to apply 3rd party library or not, I think all of us agree that we should decouple the raft and state machine implementation and improve quality at same time. Although this is very hard as have discussed, but I think we can set the aim and achieve it by small step refactors, new design consideration, etc. New discussion should be arise once there is some plan about the decouple.
   
   Any concerns, please let me know.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] LebronAl commented on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

LebronAl commented on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-918139858

> As I see the raft is one consensus algorithm, which may do not have many relations with the linearizability, just put the post[1] for reference.

In academic terms, whether it is Raft, Multi-Paxos or Zab, these consensus algorithms exist only for multi-node consensus, and have nothing to do with consistency or storage systems. I admit that.

In engineering, what we call Raft is often combined with replication state machines and storage systems. In this context, the relationship between Raft and linearizability is very large.

Take the TiKV you quoted in the blog for example. For writing, Raft needs to ensure that the log which received the majority of acks can be committed. For reading, Raft needs to use Read-Index or Lease-Read to ensure linearizability. These designs are actually decisive factors affecting performance, and the key lies in whether we need such safety. However, such core logic is likely to be hard-coded in the Raft library and cannot be changed (if it can be changed, you can ignore me). But in fact, for OLAP scenarios, do we really need such a high level of consistency? For example, [tdengine's](https://www.taosdata.com/cn/documentation/architecture#replication) default synchronization strategy is asynchronous replication rather than majority.

For another example, Zookeeper's ZAB algorithm guarantees sequential consistency, which is a slightly weaker than linearizable consistency but still a strong consistency level (refer to the [jepsen](https://jepsen.io/consistency) official website). Then in theory, it has higher performance than the Raft algorithm that guarantees linearizability.

![image](https://user-images.githubusercontent.com/32640567/133080048-8c5bb062-eacc-4334-804b-99321087f923.png)

I'm not talking about which solution is better. I just think that since we are talking about refactoring, we should think clearly about the guarantee that the entire data model can provide to the outside world. Because they may often determine the upper bound of performance, and in most cases, these are trade-offs between safety and performance. If we don't need some level of safety, then we can definitely go for better performance.

> As far as I know, the apply function is user-defined, we can still implement a parallel apply function according to different storage groups in the one raft log.

In the etcd [example](https://github.com/etcd-io/etcd/blob/main/contrib/raftexample/kvstore.go), we do have the freedom to handle the commit logs, thanks to their excellent abstraction. In ratis's [example](https://github.com/apache/ratis/blob/master/ratis-examples/src/main/java/org/apache/ratis/examples/counter/server/CounterStateMachine.java#L182), it looks like they only exposed one `applyTransaction` interface for us to override, and I doubt whether we can implement our parallel asynchronous apply optimizations. Of course, I've just had a glance. This area needs further investigation. I just hope some of our optimizations don't go away after the migration.

> What is certain is that using a raft library will definitely limit our optimization work compared with the current implementation (mixing the raft framework and business logic), but I think the availability and correctness are far greater than the performance for now.

If performance drops a little bit after migration, I support it. But if it's a big drop, I think it still needs to be considered very carefully. Of course, a preliminary conclusion can be made after further investigation.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] jt2594838 commented on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

jt2594838 commented on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-920883859


   My suggestion is, please give us a list of things your preferred frameworks used to qualify their raft implementation (tests to be specific). Then we will discuss whether it is easier to add the missing tests or directly using their implementation. I think this is much more convincing than saying "I do not know", "I worry", "I doubt".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] chengjianyun edited a comment on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

chengjianyun edited a comment on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-918015757


   Reply https://github.com/apache/iotdb/pull/3939#issuecomment-917983405, @LebronAl 
   
   > Production-level raft algorithms generally guarantee linearizability, which is the strictest consistency in distributed systems. Our current Raft algorithm may not be complete, so strictly speaking it does not guarantee linearizability, but at the same time it's performance may be better. What would we say if we migrated raft implementation and performance dropped? Furthermore, is linearizability really necessary for OLAP databases like time-series databases?
   
   What do you mean by "strictly speaking it does not guarantee linearizability"? In which case, IoTDB doesn't guarantee the linearizability? In my opinion, it doesn't matter IoTDB or other products, if you are using Raft, that means you need the linearizability which is what raft provided. If IoTDB doesn't guarantee the linearizability, I think it's a bug. Right now, I didn't find any of the case. 
   
   I agree that we can simply some aspects of Raft implementation but which won't affect performance much. The main performance improvement relays on the multi-raft support.
   
   > At the moment we have Raft implementation and business logic mixed together, but there is a certain amount of performance improvement. For example, we currently put data from multiple storage groups into one Raft group to be executed synchronously. In fact, at the bottom, they can be executed in parallel. So we implemented parallel asynchronous apply optimization so that plans from the same Raft group but different storage groups can be applied in parallel. The performance gains from this optimization are significant. How should we handle this case after we migrate raft implementation? Add parallel apply feature to raft library Or do we change our partitioning pattern so that a Raft group serves only one storage group? Or just let them execute inefficiently?
   
   > "For example, we currently put data from multiple storage groups into one Raft group to be executed synchronously."
   Could you explain this one more detail? 
   
   > Add parallel apply feature to raft library Or do we change our partitioning pattern so that a Raft group serves only one storage group?
   
   Application(Raft user) takes care of log apply, so it doesn't matter how Raft implemented.
   
   Finally, in my opinion, current high throughput mainly gains from multi-raft support. So it won't affect performance too much after integrate Ratis according to my understanding.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] chengjianyun commented on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

chengjianyun commented on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-919876433


   > No bug for now does not mean no bug forever. And different situations and applications expose different problems. The important thing is building our own verification instead of trusting others just because they have tested on their own cases.
   > 
   > I am more than sure that we will have to face some bugs whatever we use, and if just several bugs would make you lose your heart, I think it is probably not likely you will build any confidence in your own creation, and you may never get near the core of some technologies.
   
   I think the discussion out of the scope, let's narrow down the scope to the topic. IMO, IoTDB is a production, when we deliver the production to user, we need to tell user how good/bad are we. It's nothing to do with technologies or confidence. Many production are based on others implementation such as TiKV depends on RocksDB which opensourced by Facebook. Raft library is only a tool, import it could reduce unnecessary works and provide a relative high confidence module to sync data to different node.
   
   > I am not sure about what you mean by async append, but we have async apply. And we are very serious about performance problems as performance is one of the few advantages of IoTDB and why we would want to build our own consensus. Otherwise, why do you think we would bother so much. We want to continue this advantage instead of handing it out to others.
   
   Agree that the performance is one of the advantages. But even the fastest vehicle should be safe in my understanding. Do you agree this?
   
   `Async append` means leader can append log locally and broadcast log parallelly. 
   
   For `async apply`, not sure my understanding is right or not, but according code, I think it's sync in current IoTDB implemenation. The `appendEntries` acquires lock of `logManager` until get response which means apply complete. And `applyEntries` in RaftLogManager is also a sync operation. Correct me if my understanding is wrong.
   
   ```
   appendEntries in RaftManager:
   private long appendEntries(long prevLogIndex, long prevLogTerm, long leaderCommit, List<Log> logs) {
   // ... ignore lines
       synchronized (logManager) {
         long startTime = Timer.Statistic.RAFT_RECEIVER_APPEND_ENTRY.getOperationStartTime();
         resp = logManager.maybeAppend(prevLogIndex, prevLogTerm, leaderCommit, logs);
         Timer.Statistic.RAFT_RECEIVER_APPEND_ENTRY.calOperationCostTimeFromStart(startTime);
         if (resp != -1) {
           if (logger.isDebugEnabled()) {
             logger.debug("{} append a new log list {}, commit to {}", name, logs, leaderCommit);
           }
           resp = Response.RESPONSE_AGREE;
         } else {
           // the incoming log points to an illegal position, reject it
           resp = Response.RESPONSE_LOG_MISMATCH;
         }
       }
       return resp;
     }
   
   // applyEntries in RaftLogManager
     void applyEntries(List<Log> entries) {
       for (Log entry : entries) {
         applyEntry(entry);
       }
     }
   
     public void applyEntry(Log entry) {
       // For add/remove logs in data groups, this log will be applied immediately when it is
       // appended to the raft log.
       // In this case, it will apply a log that has been applied.
       if (entry.isApplied()) {
         return;
       }
       if (blockAppliedCommitIndex > 0 && entry.getCurrLogIndex() > blockAppliedCommitIndex) {
         blockedUnappliedLogList.add(entry);
         return;
       }
       try {
         logApplier.apply(entry);
       } catch (Exception e) {
         entry.setException(e);
         entry.setApplied(true);
       }
     }
   ```
   
   > I do not understand why you insist that Raft logic is mixed with business logic, as far as I know, only the implementations of partition table and applier concern how stand-alone IoTDB works, and they are already made interface. Most parts of the cluster module can work without any interference with the standalone part, and can also be tested as any raft implementation.
   
   I think we could have a much clear interface definitions and responsibility boundary if we can refine the abstract of Raft implementation. E.g. `DataGroupMember` and `MetaGroupMember` are extend from `RaftMember`, but `RaftMember` mainly handles Raft related logic such as appendEntry. But `DataGroupMember` and `MetaGroupMember` are handlers of application logic such as query or non-query plans.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] cigarl edited a comment on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

cigarl edited a comment on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-918777422


   > But the key sentence is this. Even if the bug is fixed. Should we also call async every time when we flush log? 
   
   Agree with you, in the IOT world, data may not be as sensitive,maybe we can sacrifice security _(in some cases)_ to improve performance _(we can even provide configurations to meet different needs of users)_.But for now, this approach seems to pose a greater danger _(eg. cluster can not be restarted successfully, or enter an unstable state, returning confusing results to the user, etc)_ in corner cases.
   
   BTW,maybe we should comb through some optimizations to compare.I'm so sorry that I seem to see only that the strategy of flushing to disk has an effect, the impact at the Raft level doesn't seem to be as large as expected.
   
   We have a lot of questions about `Ratis`, let me spend some time on it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] jt2594838 commented on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

jt2594838 commented on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-920630744


   You may think it is easier to switch to a new frame than to improve the current one, but it is totally not the case. Implementing a distributed version involves much more than calling some methods in a library. Interface adaption, schema conversion, exception handling, cluster organization, and data distribution... there are so many to do. On the other hand, the problems you face now are just worries, doubts, and lack of tests, instead of major issues that will take months to fix.
   
   The paralleled log execution, controllable log replication dispatching, batching, and some ongoing improvements. You may have missed some of them. And since many of us come from a university lab, we will be able to put more state-of-the-art improvements into it.
   
   SQL Server and Oracle are very safe and stable, though they are nothing about competitiveness in this field. One may speed up to take some share at the early stage, but without its core, it will be soon driven out of the market.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] LebronAl edited a comment on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

LebronAl edited a comment on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-920767517

So I suggest we go in three directions in parallel: research + refactoring + testing.

- Refactoring: Since we've recently started refactoring cluster code, I thought we could refactor Raft code as well. Ideally, it should be a single module, like [Kafka](https://github.com/apache/kafka/tree/trunk/raft). For anyone who wants to change raft library, this is basically the process of changing library, and we must to do some abstraction in order to change current Raft implementation. For those who don't want to change raft library, doing so can improve code readability and make it easier to add more complex tests. After modularization, we can enumerate some performance comparisons and pros and cons before we discuss whether raft libraries need to be replaced. At that point, I don't think we would be as divergent as we are now.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] neuyilan edited a comment on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

neuyilan edited a comment on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-920542383

I think the current focus is on the following points:

1. Can we implement a good and correct raft framework.

2. How good is the performance of the raft framework implemented by yourself?
As far as we know, the performance improvement of our cluster version is as follows:

1. async apply；
2. The query does not follow the raft, and the consistency of the query can be determined according to the configuration.
But as far as I know, the apply function of `Apache Ratis `framework can be decided by users. Moreover, the read request of Ratis does not follow the raft, so I don't think the performance will degrade much.

3. What is a competitive product?
A safe, stable and widely used product is a competitive product. If there are no available products for users, what about competitiveness?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] neuyilan commented on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

neuyilan commented on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-920542383

I think the current focus is on the following two points:

1. Can we implement a good and correct raft framework.

2. How good is the performance of the raft framework implemented by yourself?
As far as we know, the performance improvement of our cluster version is as follows:

3. What is a competitive product?
A safe, stable and widely used product is a competitive product. If there are no available products for users, what about competitiveness?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] LebronAl edited a comment on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

LebronAl edited a comment on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-918163937

The devil of consensus lies in the corner case. So let me give you three examples that I know so far that may violate linearizability.
1. The uncommitted Raft logs are not persisted. Consider a scenario where node A is the leader, nodes B and C are followers, node A synchronizes the log to B and C and gets all acks, then submits and applies the log, and then returns success to the client before the next heartbeat to followers. Then there was a momentary power outage, nodes B and C restarted, and node A restarted slowly. Node B and C's log are empty after restarting and recovering, but they are already a majority, so they can serve the client's read request, then this may violate linearizability. Of course, this is strictly an implementation bug and can be fixed. But even if the bug is fixed, you can see that our raft log's serialization buffer refers to the implementation of the stand-alone WAL, it does not write to the disk and call async every time a log is written, which ensures that the performance will not be limited by the IOPS of the disk. But this may also cause the corner case mentioned above. This is t
he trade-off between performance and safety.
2. In fact, to ensure linearizability Raft uses read-index or lease-read. In our current implementation, direct-read is used for the leader and read-index is used for the follower. This can violate linear consistency when a node outage causes a replacement node to execute the same read request. For more specific examples, you can refer to my [blog](https://tanxinyu.work/consistency-and-consensus/). Of course, we could also use read-index on the leader, but this would undoubtedly degrade performance.
3. In fact, Raft's naive implementation guarantees at-least-once semantics, and to ensure linearizability semantics, uuid is generated on the client side and a map is logged on the server side to ensure that each command is executed only once. You can refer to section 6.2 of [Raft's PhD thesis](https://web.stanford.edu/~ouster/cgi-bin/papers/OngaroPhD.pdf) and dragonboat's discussion on [Zhihu](https://www.zhihu.com/question/278551592).There is no doubt that such an implementation will also affect performance.

>Application(Raft user) takes care of log apply, so it doesn't matter how Raft implemented.

I hope so~Maybe we need to make a detailed investigation on `Ratis`

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] LebronAl edited a comment on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

LebronAl edited a comment on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-918163937

The devil of consensus lies in the corner case. So let me give you three examples that I know so far that may violate linearizability.
1. The uncommitted Raft logs are not persisted. Consider a scenario where node A is the leader, nodes B and C are followers, node A synchronizes the log to B and C and gets all acks, then submits and applies the log, and then returns success to the client before the next heartbeat to followers. Then there was a momentary power outage, nodes B and C restarted immediately, but node A restarted slowly. Node B and C's log are empty after restarting and recovering, but they are already a majority, so they can serve the client's read request, then this may violate linearizability. Of course, this is strictly an implementation bug and can be fixed. But even if the bug is fixed, you can see that our raft log's serialization buffer refers to the implementation of the stand-alone WAL, it does not write to the disk and call async every time a log is written, which ensures that the performance will not be limited by the IOPS of the disk. But this may also cause the corner case mentioned abov
e. This is the trade-off between performance and safety.
2. In fact, to ensure linearizability Raft uses read-index or lease-read. In our current implementation, direct-read is used for the leader and read-index is used for the follower. This can violate linearizability when a node outage causes a replacement node to execute the same read request. For more specific examples, you can refer to my [blog](https://tanxinyu.work/consistency-and-consensus/). Of course, we could also use read-index on the leader, but this would undoubtedly degrade performance.
3. In fact, Raft's naive implementation guarantees at-least-once semantics, and to ensure linearizability semantics, uuid is generated on the client side and a map is logged on the server side to ensure that each command is executed only once. You can refer to section 6.2 of [Raft's PhD thesis](https://web.stanford.edu/~ouster/cgi-bin/papers/OngaroPhD.pdf) and dragonboat's discussion on [Zhihu](https://www.zhihu.com/question/278551592).There is no doubt that such an implementation will also affect performance.

>Application(Raft user) takes care of log apply, so it doesn't matter how Raft implemented.

I hope so~Maybe we need to make a detailed investigation on `Ratis`

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] LebronAl commented on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

LebronAl commented on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-918163937

The devil of consensus lies in the corner case. So let me give you three examples that I know of so far that violate linearizability.
1. The uncommitted Raft logs are not persisted. Consider a scenario where node A is the leader, nodes B and C are followers, node A synchronizes the log to B and C and gets all acks, then submits and applies the log, and then returns success to the client before the next heartbeat to followers. Then there was a momentary power outage, nodes B and C restarted, and node A restarted slowly. Node B and C's log are empty after restarting and recovering, but they are already a majority, so they can serve the client's read request, then this may violate linearizability. Of course, this is strictly an implementation bug and can be fixed. But even if the bug is fixed, you can see that our raft log's serialization buffer refers to the implementation of the stand-alone WAL, it does not write to the disk and call async every time a log is written, which ensures that the performance will not be limited by the IOPS of the disk. But this may also cause the corner case mentioned above. This is t
he trade-off between performance and safety.
2. In fact, to ensure linearizability Raft uses read-index or lease-read. In our current implementation, direct-read is used for the leader and read-index is used for the follower. This can violate linear consistency when a node outage causes a replacement node to execute the same read request. For more specific examples, you can refer to my [blog](https://tanxinyu.work/consistency-and-consensus/). Of course, we could also use read-index on the leader, but this would undoubtedly degrade performance.
3. In fact, Raft's naive implementation guarantees at-least-once semantics, and to ensure linearizability semantics, uuid is generated on the client side and a map is logged on the server side to ensure that each command is executed only once. You can refer to section 6.2 of [Raft's PhD thesis](https://web.stanford.edu/~ouster/cgi-bin/papers/OngaroPhD.pdf) and dragonboat's discussion on [Zhihu](https://www.zhihu.com/question/278551592).There is no doubt that such an implementation will also affect performance.

>Application(Raft user) takes care of log apply, so it doesn't matter how Raft implemented.

I hope so~Maybe we need to make a detailed investigation on `Ratis`

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] chengjianyun commented on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

chengjianyun commented on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-920815354

Summary different opinions and reorg the content to make things clear.

## Current situation of cluster availability

- https://github.com/apache/iotdb/pull/3939 The bug is found by reviewing code.
- https://github.com/apache/iotdb/pull/3930 The bug is found by error msg in a restart fail case. Actually the error msg has been ignored for a long time because server can start normally in most cases.
- https://github.com/apache/iotdb/pull/3848 This is a defensive change after reviewing code.
- https://github.com/apache/iotdb/pull/3832 The bug is found in a sever can't start case.

The bugs are find/fixed by us in recent month. Some of the bugs is found by reviewing code and look like obvious and the other are reported in Raft unrelated issue. I have reasons to believe that there are much more bugs we don't know. Some of the bug may cause data lose, as IoTDB is most used in big data scenario, it's hard to feel when a small part of data lost. That's the worst situation, we don't know when that will happen and what result it will cause.

## Goal of discussion

Find a way to improve availability of cluster of IoTDB and keep outstanding performance in a short time(2 months around).

## Assumption

The assumptions are the base of the discussion. We don't need to discussion on these any more.

1. Chosen 3rd party library of Raft has been widely used in industry, has high quality and the algorithm is correct in known cases. As has been widely used, I think we could believe `known cases` are `all cases` we could ever have.
2. An application scenario may cause many scenarios in Raft.

## How to achieve the goal

### Import 3rd party Raft framework
#### Plan
1. Abstract proper interfaces for Raft, make it a module gradually .
2. At same time, we can investigate some 3rd party libraries and find if we can have a proper one.
3. Import the chosen library to replace our raft module if needed.

Of course, things are not as easy as said but it's clear. After this, the tests can be focused on application scenarios which are much less than raft scenarios.

#### Advantages
- Doable
- Raft correctness is guaranteed
- Simplify tests, no UT for Raft

#### Disadvantages
- Maybe lose some performance
- Lost control of Raft, update need a relative long term
- Can't customize raft to improve performance

### Improvement base on current implementation
#### Plan
??

#### Advantages
- Could customize raft to optimize performance.
- Raft update could be done in a short term.

#### Disadvantages
- Correctness of Raft is hard to guarantee.
- It's hard to make raft tests complete.

## My thought

I don't support the solution that improve current implementation because:
- I don't know the current tests for raft are complete or not
- I don't know how far we are from to make raft test cases complete

As TiKV implemented all tests of raft in etcd(after it has been stabilized for years) with RUST, if we could do something similar in a short time, I think its OK improve base on current implementation by ourselves. And optimize raft at this stage is not a good idea, please let's make it correct first.

Supplement if I miss anything!

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] jt2594838 edited a comment on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

jt2594838 edited a comment on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-919646300

Linearizability consists of two aspects: consensus and query implementation, we may have spent some efforts on consensus, but little on query implementation, and you should not blame one for another. Meanwhile, queries in IoTDB may be very different from those in systems like etcd, and the concern of query implementation could be also different.

Relying on another library for core functionalities is risky. Other projects keep bug fixing and refactoring too, it is hard to say if we can catch up with them after some time point. And it would become very difficult to implement any improvements on it, which may eventually make IoTDB lack of competitiveness, as the most important advantage of current IoTDB lies on its superior performance, we would completely lose control if the consensus part becomes a bottleneck. Ans should any major bug occur, we will have to wait another team to fix it, because fixing it ourselves will create forks.

Also from my personal point of view, I would not hope IoTDB to just become an integration of libraries, without owning its core techniques, which will make us easily overtaken by those who are more willing to challenge.

Above all, most importantly, before you turn to another method, you should answer the following questions:
1. What is the problem of the current method, do I have any proof?
2. Is the new method really better than the current one, how do I show it?
3. Is switching to the new method a better choice than improving the current one, is it really that easy?
Being doubtful is good, but solid evidence is more convincing.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] LebronAl edited a comment on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

LebronAl edited a comment on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-918800046






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] LebronAl commented on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

LebronAl commented on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-918793656


   > IMO, this depends on what kinds of properties IoTDB'd like to guarantee for user. If IoTDB says, I won't lost any data in any situation, then persist each log every time is required and relative performance lose should be acceptable. But if it's doesn't do such guarantee, then it's OK to do some optimization. Of course, we can make it configurable just like standalone WAL in IoTDB. As I didn't look into Ratis, I'm not sure if it makes this configurable.
   
   Seems can be [configurable](https://github.com/apache/ratis/blob/7a852736de1588bc645f86a92a1d6b4d37b21e2b/ratis-server-api/src/main/java/org/apache/ratis/server/RaftServerConfigKeys.java#L268). Anyway, Let's speed some time exploring  it.
   
   >I don't understand, on zhihu, it says "proposal id" is generated by client, and filter when do apply(often implemented by State Machine). So I think the retry logic is transparent for raft, the retry operation is new log for raft. Not sure my understanding is right or not.
   
   maybe you could read my [doc](https://github.com/LebronAl/MIT6.824-2021/blob/master/docs/lab3.md) of MIT 6.824 lab3, which is slightly more understandable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] LebronAl edited a comment on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

LebronAl edited a comment on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-918760291


   > Agree that the case is a bug for IoTDB.
   
   Yes
   
   > Of course, this is strictly an implementation bug and can be fixed. But even if the bug is fixed, you can see that our raft log's serialization buffer refers to the implementation of the stand-alone WAL, it does not write to the disk and call async every time a log is written, which ensures that the performance will not be limited by the IOPS of the disk. But this may also cause the corner case mentioned above. This is the trade-off between performance and safety.
   
   But the key sentence is this. Even if the bug is fixed. Should we also call async every time when we flush log? The current implementation is a performance-oriented compromise for us, but it certainly sacrifices security. Strictly speaking, we currently buffer data through user-space buffers, whether standalone wal or raft logs. However, with [Ratis](https://github.com/apache/ratis/blob/7a852736de1588bc645f86a92a1d6b4d37b21e2b/ratis-server/src/main/java/org/apache/ratis/server/raftlog/segmented/SegmentedRaftLogWorker.java#L510), they seem to be caching the data in the operating system's Page cache, and `async` is determined based on the parameters and load. I think our approach’s performance will be better (because it sacrifices more security than theirs), but I don't know how much of performance difference there is. This is worth exploring.
   
   > And let's go back to Raft library topic. Import the Raft library doesn't mean application needs to sync all operation via the library, it depends on requirement. I don't think IoTDB needs to change current read action process flow after import the library.
   
   Hope so. IMO, the more additional code we have to write using these Raft libraries, the more optimization possibilities we have for our scenario, just like with etcd. But with Ratis, I'm not sure yet.
   
   > Agree the concept but which is unnecessary under IoTDB or time-order data environment. I think the exactly-only-once requirement here as the actions are not idempotent for all system.
   
   In fact, even if every command is idempotent, as long as the client adds retry logic, it will violate linearizability without filtering as described above... This is a fairly advanced feature, as far as I know, only the Raft library for Dragonboat supports this logic, and I don't think Ratis has done it yet. So we can forget about this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] jt2594838 commented on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

jt2594838 commented on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-919809646

No bug for now does not mean no bug forever. And different situations and applications expose different problems. The important thing is building our own verification instead of trusting others just because they have tested on their own cases.

I am more than sure that we will have to face some bugs whatever we use, and if just several bugs would make you lose your heart, I think it is probably not likely you will build any confidence in your own creation, and you may never get near the core of some technologies.

I am not sure about what you mean by async append, but we have async apply. And we are very serious about performance problems as performance is one of the few advantages of IoTDB and why we would want to build our own consensus. Otherwise, why do you think we would bother so much. We want to continue this advantage instead of handing it out to others.

I do not understand why you insist that Raft logic is mixed with business logic, as far as I know, only the implementations of partition table and applier concern how stand-alone IoTDB works, and they are already made interface. Most parts of the cluster module can work without any interference with the standalone part, and can also be tested as any raft implementation.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] jt2594838 edited a comment on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

jt2594838 edited a comment on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-920845826


   The listed issues concentrate on raft log management, which is already a known issue that needs refactoring, and it only contributes to a small part of the whole raft framework. While there are about 100 issues concerning the usage over raft. It is much more reasonable to think switching to a new framework will cause significantly more issues than fixing the current one.
   
   The tests of raft is merely a small part of the whole system, as the algorithm itself is quite simple and is well explained in the paper. What really matters is how you integrate raft with your own business, which, unfortunately, cannot be helped no matter what framework you use. So if you think using an existing framework could save a lot of tests, that is not the case. However, what is sure is that you will have to follow someone else's design pattern to redesign your interfaces and make adaptions, accepting their system and application assumption, which may not be easy and pleasent.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] jt2594838 commented on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

jt2594838 commented on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-920523977

Most reliable safety is built on the top of deep knowledge of the techniques you use, not how others say it is safe. Sure a lib it is, but what are "unnecessary works"? Our initial purpose was to create a fully controllable cluster module that will not become a bottleneck in the future. Consensus may be the most important thing above all, and it deserves our primary efforts. It is not JUST a tool, it could be the foundation of the whole cluster module.

Of course safety matters, but how can you assert we can not make it safe? Are we really less capable than those who have written those libs, especially we can integrate the experiences of all of them as they are all open-source. I think this is what open-source is all about: we learn from others instead of we copy from others. After all, open-source or not, you can still use compiled libs from others, while being open-source, you learn from them what are the good practices and bad practices so that you can improve your own.

The sync is used because the client wants a result so we must wait until the end of execution, if the result is not desired, the block can be easily removed to achieve a fully synchronized method.

I am not sure what you mean by a more clear interface. `RaftMember` provides basic Raft implementation, based on it, `MetaGroupMember` forms a meta group to manage metadata, `DataGroupMember` forms data groups to manage data. Raft logic and business logic are obviously separated, and I think it is clear enough. Please specify your problems if any.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] chengjianyun edited a comment on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

chengjianyun edited a comment on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-920815354

Summary different opinions and reorg the content to make things clear.

## Current situation of cluster availability

## Goal of discussion

Find a way to improve availability of cluster of IoTDB and keep outstanding performance in a short time(2 months around in my opinion).

## Assumption

The assumptions are the base of the discussion. We don't need to discussion on these any more.

## How to achieve the goal

Of course, things are not as easy as said but it's clear. After this, the tests can be focused on application scenarios which are much less than raft scenarios.

#### Advantages
- Doable
- Raft correctness is guaranteed
- Simplify tests, no UT for Raft

#### Disadvantages
- Maybe lose some performance
- Lost control of Raft, update need a relative long term
- Can't customize raft to improve performance

### Improvement base on current implementation
#### Plan
??

#### Advantages
- Could customize raft to optimize performance.
- Raft update could be done in a short term.

#### Disadvantages
- Correctness of Raft is hard to guarantee.
- It's hard to make raft tests complete.

## My thought

Supplement if I miss anything!

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] LebronAl edited a comment on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

LebronAl edited a comment on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-920767517

So I suggest we go in three directions in parallel: research + refactoring + testing.

- Refactoring: Since we've recently started refactoring cluster code, I thought we could refactor Raft code as well. Ideally, it should be a single module, like [Kafka](https://github.com/apache/kafka/tree/trunk/raft). For anyone who wants to change raft library, this is basically the process of changing library, and we must to do some abstraction in order to change current Raft implementation. For those who don't want to change raft library, doing so can improve code readability and make it easier to add more complex tests. After modularization, we can enumerate some performance comparisons and pros and cons before we discuss whether raft libraries need to be replaced. I don't think we would be as divergent as we are now.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] LebronAl edited a comment on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

LebronAl edited a comment on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-920767517

So I suggest we go in three directions in parallel: research + refactoring + testing.

- Refactoring: Since we've recently started refactoring cluster code, I thought we could refactor Raft code as well. Ideally, it should be a single module, like [Kafka](https://github.com/apache/kafka/tree/trunk/raft). For anyone who wants to change raft library, this is basically the process of changing library, and we must to do some abstraction in order to change current Raft algorithm. For those who don't want to change raft library, doing so can improve code readability and make it easier to add more complex tests. After modularization, we can enumerate some performance comparisons and pros and cons before we discuss whether raft libraries need to be replaced. I don't think we would be as divergent as we are now.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] jt2594838 commented on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

jt2594838 commented on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-921407741


   There is also another more practical problem. Both Raft and IoTDB have their own log management, and we aim to combine them or replace one with another. I think state machine safety is more important than consensus safety since breaking consensus requires more than one node to malfunction while a single node failure may break state machine safety, so IoTDB WAL will probably be the core of the final log.
   
   However, the log management is enclosed in other frameworks as it normally does not interfere with the business logic, so it will be extremely hard to achieve the purpose of unifying logs with one of the implementation is hidden behind.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] chengjianyun commented on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

chengjianyun commented on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-918015757


   Reply https://github.com/apache/iotdb/pull/3939#issuecomment-917983405, @LebronAl 
   
   > Production-level raft algorithms generally guarantee linearizability, which is the strictest consistency in distributed systems. Our current Raft algorithm may not be complete, so strictly speaking it does not guarantee linearizability, but at the same time it's performance may be better. What would we say if we migrated raft implementation and performance dropped? Furthermore, is linearizability really necessary for OLAP databases like time-series databases?
   
   What do you mean by "strictly speaking it does not guarantee linearizability"? In which case, IoTDB doesn't guarantee the linearizability? In my opinion, it doesn't matter IoTDB or other products, if you are using Raft, that means you need the linearizability which is what raft provided. If IoTDB doesn't guarantee the linearizability, I think it's a bug. Right now, I didn't find any of the case. 
   
   I agree that we can simply some aspects of Raft implementation but which won't affect performance much. The main performance improvement relays on the multi-raft support.
   
   > At the moment we have Raft implementation and business logic mixed together, but there is a certain amount of performance improvement. For example, we currently put data from multiple storage groups into one Raft group to be executed synchronously. In fact, at the bottom, they can be executed in parallel. So we implemented parallel asynchronous apply optimization so that plans from the same Raft group but different storage groups can be applied in parallel. The performance gains from this optimization are significant. How should we handle this case after we migrate raft implementation? Add parallel apply feature to raft library Or do we change our partitioning pattern so that a Raft group serves only one storage group? Or just let them execute inefficiently?
   
   > "For example, we currently put data from multiple storage groups into one Raft group to be executed synchronously."
   Could you explain this one more detail? 
   
   > Add parallel apply feature to raft library Or do we change our partitioning pattern so that a Raft group serves only one storage group?
   
   Application(Raft user) takes care of log apply, so it doesn't matter how Raft implemented.
   
   Finally, in my opinion, current performance main gains from multi-raft support. So it won't affect performance too much after integrate Ratis.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] jt2594838 commented on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

jt2594838 commented on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-919646300

Relying on another library for core functionalities is risky. Other projects keep bug fixing and refactoring too, it is hard to say if we can catch up with them after some time point. And it would become very difficult to implement any improvements on it, which may eventually make IoTDB lack of competitiveness, as the most important advantage of current IoTDB lies on its superiour performance, we would completely lose control if the consensus part becomes a bottleneck. Ans should any major bug occur, we will have to wait another team to fix it, because fixing it ourselves will create forks.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] chengjianyun commented on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

chengjianyun commented on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-919788803


   > Linearizability consists of two aspects: consensus and query implementation, we may have spent some efforts on consensus, but little on query implementation, and you should not blame one for another. Meanwhile, queries in IoTDB may be very different from those in systems like etcd, and the concern of query implementation could be also different.
   
   Can't agree more.
   
   > Relying on another library for core functionalities is risky. Other projects keep bug fixing and refactoring too, it is hard to say if we can catch up with them after some time point. And it would become very difficult to implement any improvements on it, which may eventually make IoTDB lack of competitiveness, as the most important advantage of current IoTDB lies on its superior performance, we would completely lose control if the consensus part becomes a bottleneck. Ans should any major bug occur, we will have to wait another team to fix it, because fixing it ourselves will create forks.
   
   The raft libraries often only provides the ability of consensus on Raft log, it doesn't affect business logic much and doesn't force everything is linearization. With proper 3rd party library, we could have enough flexibility to optimize and improvement performance. The production 3rd party libraries should have been verified by some great productions so that we can believe no major bugs happen.
   
   > Also from my personal point of view, I would not hope IoTDB to just become an integration of libraries, without owning its core techniques, which will make us easily overtaken by those who are more willing to challenge.
   > 
   > Above all, most importantly, before you turn to another method, you should answer the following questions:
   > 
   > 1. What is the problem of the current method, do I have any proof?
   Updated in description of the issue.
   > 2. Is the new method really better than the current one, how do I show it?
   From view of availability, maintainability and code structure, I think yes, the new method would be better than current one. For performance, I'm not sure as it needs experiment. But I think performance lose will be small or no performance lose. Because current implementation of Raft doesn't apply most of optimizations which has been made in other system such as async append, async apply, etc.
   > 3. Is switching to the new method a better choice than improving the current one, is it really that easy?
   The safety properties is too agile to guarantee without enough tests for Raft. Building up a systematic test cases is really hard and looks is impossible under currently implementation(Raft and Business logic mix together). Of course, integrate the new library means we may change current structure, the effort can be evaluated after investigation. 
   >    Being doubtful is good, but solid evidence is more convincing.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] LebronAl edited a comment on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

LebronAl edited a comment on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-918163937

The devil of consensus lies in the corner case. So let me give you three examples that I know so far that may violate linearizability.
1. The uncommitted Raft logs are not persisted. Consider a scenario where node A is the leader, nodes B and C are followers, node A synchronizes the log to B and C and gets all acks, then submits and applies the log, and then returns success to the client before the next heartbeat to followers. Then there was a momentary power outage, nodes B and C restarted immediately, but node A restarted slowly. Node B and C's log are empty after restarting and recovering, but they are already a majority, so they can serve the client's read request, then this may violate linearizability. Of course, this is strictly an implementation bug and can be fixed. But even if the bug is fixed, you can see that our raft log's serialization buffer refers to the implementation of the stand-alone WAL, it does not write to the disk and call async every time a log is written, which ensures that the performance will not be limited by the IOPS of the disk. But this may also cause the corner case mentioned abov
e. This is the trade-off between performance and safety.
2. In fact, to ensure linearizability Raft uses read-index or lease-read. In our current implementation, direct-read is used for the leader and read-index is used for the follower. This can violate linearizability when a node outage causes a replacement node to execute the same read request. For more specific examples, you can refer to my [blog](https://tanxinyu.work/consistency-and-consensus/). Of course, we could also use read-index on the leader, but this would undoubtedly degrade performance.
3. In fact, Raft's naive implementation guarantees at-least-once semantics, and to ensure linearizability semantics, uuid is generated on the client side and a map is logged on the server side to ensure that each command is executed only once. You can refer to section 6.3 of [Raft's PhD thesis](https://web.stanford.edu/~ouster/cgi-bin/papers/OngaroPhD.pdf) and dragonboat's discussion on [Zhihu](https://www.zhihu.com/question/278551592).There is no doubt that such an implementation will also affect performance.

>Application(Raft user) takes care of log apply, so it doesn't matter how Raft implemented.

I hope so~Maybe we need to make a detailed investigation on `Ratis`

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] LebronAl commented on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

LebronAl commented on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-918800046


   To conclude, let's start by investigating Java implementations of some raft libraries, such as Ratis, Sofajraft, etc.
   In the process of exploring, I have come up with four aspects that need to be paid attention to, and you can add more:
   1. How Raft libraries integrate with the state machine: is there a way like etcd to send committed logs back to the upper level for free processing (best so we can do our own optimizations including consistency model considerations)? Or just abstracting the apply interface and letting us override (which may not be very friendly).
   2. Can policies for Raft log persistence be configured: forceSync, batch sync, etc.
   3. Are some common Raft optimizations done: batching/ Pipeline /Async apply, which are very common and effective Raft optimizations.
   4. What was done to support multi-raft: was there any pooling done, etc.? This may be of great help in massive raft groups.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] chengjianyun commented on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

chengjianyun commented on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-918776584


   > > Agree that the case is a bug for IoTDB.
   > 
   > Yes
   > 
   > > Of course, this is strictly an implementation bug and can be fixed. But even if the bug is fixed, you can see that our raft log's serialization buffer refers to the implementation of the stand-alone WAL, it does not write to the disk and call async every time a log is written, which ensures that the performance will not be limited by the IOPS of the disk. But this may also cause the corner case mentioned above. This is the trade-off between performance and safety.
   > 
   > But the key sentence is this. Even if the bug is fixed. Should we also call async every time when we flush log? The current implementation is a performance-oriented compromise for us, but it certainly sacrifices security. Strictly speaking, we currently buffer data through user-space buffers, whether standalone wal or raft logs. However, with [Ratis](https://github.com/apache/ratis/blob/7a852736de1588bc645f86a92a1d6b4d37b21e2b/ratis-server/src/main/java/org/apache/ratis/server/raftlog/segmented/SegmentedRaftLogWorker.java#L510), they seem to be caching the data in the operating system's Page cache, and `async` is determined based on the parameters and load. I think our approach’s performance will be better (because it sacrifices more security than theirs), but I don't know how much of performance difference there is. This is worth exploring.
   
   IMO, this depends on what kinds of properties IoTDB'd like to guarantee for user. If IoTDB says, I won't lost any data in any situation, then persist each log every time is required and relative performance lose should be acceptable. But if it's doesn't do such guarantee, then it's OK to do some optimization. Of course, we can make it configurable just like standalone WAL in IoTDB. As I didn't look into Ratis, I'm not sure if it makes this configurable.
   
   > > And let's go back to Raft library topic. Import the Raft library doesn't mean application needs to sync all operation via the library, it depends on requirement. I don't think IoTDB needs to change current read action process flow after import the library.
   > 
   > Hope so. IMO, the more additional code we have to write using these Raft libraries, the more optimization possibilities we have for our scenario, just like with etcd. But with Ratis, I'm not sure yet.
   > 
   > > Agree the concept but which is unnecessary under IoTDB or time-order data environment. I think the exactly-only-once requirement here as the actions are not idempotent for all system.
   > 
   > In fact, even if every command is idempotent, as long as the client adds retry logic, it will violate linearizability without filtering as described above... This is a fairly advanced feature, as far as I know, only the Raft library for Dragonboat supports this logic, and I don't think Ratis has done it yet. So we can forget about this.
   
   I don't understand, on [zhihu](https://www.zhihu.com/question/278551592), it says "proposal id" is generated by client, and filter when do apply(often implemented by State Machine). So I think the retry logic is transparent for raft, the retry operation is new log for raft. Not sure my understanding is right or not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] LebronAl commented on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

LebronAl commented on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-920506683


    > And applyEntries in RaftLogManager is also a sync operation. Correct me if my understanding is wrong.
   
   You will see an `AsyncDataLogApplier` if you go further. And we did a lot more rather than just `async apply`. And I call it `Parallel async apply` before, which means data from different storage groups can be applied in parallel. You can check the code again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] chengjianyun edited a comment on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

chengjianyun edited a comment on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-918015757


   Reply https://github.com/apache/iotdb/pull/3939#issuecomment-917983405, @LebronAl 
   
   > Production-level raft algorithms generally guarantee linearizability, which is the strictest consistency in distributed systems. Our current Raft algorithm may not be complete, so strictly speaking it does not guarantee linearizability, but at the same time it's performance may be better. What would we say if we migrated raft implementation and performance dropped? Furthermore, is linearizability really necessary for OLAP databases like time-series databases?
   
   What do you mean by "strictly speaking it does not guarantee linearizability"? In which case, IoTDB doesn't guarantee the linearizability? In my opinion, it doesn't matter IoTDB or other products, if you are using Raft, that means you need the linearizability which is what raft provided. If IoTDB doesn't guarantee the linearizability, I think it's a bug. Right now, I didn't find any of the case. 
   
   I agree that we can simply some aspects of Raft implementation but which won't affect performance much. The main performance improvement relays on the multi-raft support.
   
   > At the moment we have Raft implementation and business logic mixed together, but there is a certain amount of performance improvement. For example, we currently put data from multiple storage groups into one Raft group to be executed synchronously. In fact, at the bottom, they can be executed in parallel. So we implemented parallel asynchronous apply optimization so that plans from the same Raft group but different storage groups can be applied in parallel. The performance gains from this optimization are significant. How should we handle this case after we migrate raft implementation? Add parallel apply feature to raft library Or do we change our partitioning pattern so that a Raft group serves only one storage group? Or just let them execute inefficiently?
   
   > "For example, we currently put data from multiple storage groups into one Raft group to be executed synchronously."
   Could you explain this one more detail? 
   
   > Add parallel apply feature to raft library Or do we change our partitioning pattern so that a Raft group serves only one storage group?
   
   Application(Raft user) takes care of log apply, so it doesn't matter how Raft implemented.
   
   Finally, in my opinion, current high throughput mainly gains from multi-raft support. So it won't affect performance too much after integrate Ratis.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] neuyilan commented on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

neuyilan commented on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-918019389


   > is linearizability really necessary for OLAP databases like time-series databases?
   
   As I see the raft is one consensus algorithm, which may do not have many relations with the linearizability, just put the post[1] for reference.
   > we currently put data from multiple storage groups into one Raft group to be executed synchronously.
   
   As far as I know, the apply function is user-defined, we can still implement a parallel apply function according to different storage groups in the one raft log.
   
   What is certain is that using a raft library will definitely limit our optimization work compared with the current implementation (mixing the raft framework and business logic), but I think the availability and correctness are far greater than the performance for now.
   [1] https://zhuanlan.zhihu.com/p/47117804


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] LebronAl commented on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

LebronAl commented on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-918760291


   > Agree that the case is a bug for IoTDB.
   
   Yes
   
   > Of course, this is strictly an implementation bug and can be fixed. But even if the bug is fixed, you can see that our raft log's serialization buffer refers to the implementation of the stand-alone WAL, it does not write to the disk and call async every time a log is written, which ensures that the performance will not be limited by the IOPS of the disk. But this may also cause the corner case mentioned above. This is the trade-off between performance and safety.
   
   But the key sentence is this. Even if the bug is fixed. Should we also call async every time when we flush log? The current implementation is a performance-oriented compromise for us, but it certainly sacrifices security. Strictly speaking, we currently buffer data through user-space buffers, whether standalone wal or raft logs. However, with [Ratis](https://github.com/apache/ratis/blob/7a852736de1588bc645f86a92a1d6b4d37b21e2b/ratis-server/src/main/java/org/apache/ratis/server/raftlog/segmented/SegmentedRaftLogWorker.java#L510), they seem to be caching the data in the operating system's Page cache, and `async` is determined based on the parameters and load. I think our approach’s performance will be better (because it sacrifices more security than theirs), but I don't know how much of performance difference there is. This is worth exploring.
   
   > And let's go back to Raft library topic. Import the Raft library doesn't mean application needs to sync all operation via the library, it depends on requirement. I don't think IoTDB needs to change current read action process flow after import the library.
   
   Hope so. IMO, the more additional code we have to write using these Raft libraries, the more optimization possibilities we have for our scenario, just like with etcd. But with Ratis, I'm not sure yet.
   
   > Agree the concept but which is unnecessary under IoTDB or time-order data environment. I think the exactly-only-once requirement here as the actions are not idempotent for all system.
   
   In fact, even if every command is idempotent, as long as the client adds retry logic, it will violate linearizability without filtering as described above... This is a fairly advanced feature, as far as I know, only the Raft library for Dragonboat supports this logic, and I don't think Ratis has done it yet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] LebronAl edited a comment on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

LebronAl edited a comment on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-920767517

The goal for all of us was to make the cluster module more stable: some poeple felt it was better to use a mature Raft library because Raft was hard to be implemented correctly, but I also observed [Kafka](https://github.com/apache/kafka/tree/trunk/raft) writing their own Raft instead of using other Raft libraries. Other people don't feel the need to make big changes right now because they don't seem to have many problems at the moment, but using a great Raft library like etcd frees up the consensus bottleneck of the entire cloud native.

So I suggest we go in three directions in parallel: research + refactoring + testing.

- Refactoring: Since we've recently started refactoring cluster code, I thought we could refactor Raft code as well. Ideally, it should be a single module, like [Kafka](https://github.com/apache/kafka/tree/trunk/raft). For anyone who wants to change raft library, this is basically the process of changing library, and we must to do some abstraction in order to replace current Raft implementation. For those who don't want to change raft library, doing so can improve code readability and make it easier to add more complex tests. After modularization, we can enumerate some performance comparisons and pros and cons before we discuss whether raft libraries need to be replaced. At that point, I don't think we would be as divergent as we are now.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [iotdb] LebronAl edited a comment on issue #3954: Integrate Apache Ratis to help manage Raft status

Posted by GitBox <gi...@apache.org>.

LebronAl edited a comment on issue #3954:
URL: https://github.com/apache/iotdb/issues/3954#issuecomment-920767517

The goal for all of us was to make the cluster module more stable: some poeple felt it was better to use a mature Raft library because Raft was hard to be implemented correctly, but I also observed [Kafka](https://github.com/apache/kafka/tree/trunk/raft) writing their own Raft instead of using other Raft libraries. Other people don't feel the need to make big changes right now because they don't seem to have many problems at the moment, but using a great Raft library like etcd frees up the consensus bottleneck of the entire cloud native.

So I suggest we go in three directions in parallel: research + refactoring + testing.

- Refactoring: Since we've recently started refactoring cluster code, I thought we could refactor Raft code as well. Ideally, it should be a single module, like [Kafka](https://github.com/apache/kafka/tree/trunk/raft). For anyone who wants to change raft library, this is basically the process of changing library, and we must to do some abstraction in order to change current Raft implementation. For those who don't want to change raft library, doing so can improve code readability and make it easier to add more complex tests. After modularization, we can enumerate some performance comparisons and pros and cons before we discuss whether raft libraries need to be replaced. At that point, I don't think we would be as divergent as we are now.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@iotdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org