You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "clolov (via GitHub)" <gi...@apache.org> on 2023/05/25 12:24:29 UTC

[GitHub] [kafka] clolov commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception

clolov commented on PR #13421:
URL: https://github.com/apache/kafka/pull/13421#issuecomment-1562816088

   Heya @hudeqi, could you give more detailed explanation on what the problem you are trying to solve here because I do not understand? May I suggest you distinguish between a partition replica and a partition future replica in some way, because otherwise it is quite difficult to understand which replica you are referring to when it is just called "partition"?
   
   Let's say we have an original replica in log directory (backed by one disk) A, and let's say we have a future replica in log directory (backed by another disk) B on the same broker. I have confirmed that it is the case that **compaction** is paused on A when B is first created (https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/server/ReplicaManager.scala#L910). It is only resumed (https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/log/LogManager.scala#L1099). I can think of two situations happening now.
   
   1. A fails and B doesn't know what to do.
   In this situation, if A has failed then cleaning should not be resumed on A - an operator intervention is required to understand what went wrong with the log directory. Cleaning should not be started on B either. Since A has failed the amount of data that B has does not grow because it doesn't have a source to keep copying from.
   
   2. B fails and A doesn't know what to do.
   In this situation B's size cannot grow because it is no longer copying from A. A should resume cleaning as for all intents and purposes it acts as a normal replica to a topic partition.
   
   As far as I understand what you try to do is solve situation number 2 - am I correct? If I am correct, what is the reasoning behind marking A as a failed partition?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org