You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by GitBox <gi...@apache.org> on 2022/09/22 08:28:04 UTC

[GitHub] [kafka] dajac opened a new pull request, #12674: KAFKA-14255: Fetching from follower should be disallowed if fetch from follower is disabled

dajac opened a new pull request, #12674:
URL: https://github.com/apache/kafka/pull/12674

   There are clients out there that have implemented KIP-392 (Fetch From Follower) and thus use FetchRequest >= 11. However, they have not implemented KIP-320 which add the leader epoch to the FetchRequest in version 9. Without KIP-320, it is not safe to fetch from the follower. If a client does it by mistake – e.g. based on stale metadata – that could lead to offset out of range.
   
   This patch proposes to disable fetching from a follower when the cluster does not have a replica selector. If it does not, consumers are not expected to fetch from followers.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] hachikuji commented on pull request #12674: KAFKA-14255: Fetching from follower should be disallowed if fetch from follower is disabled

Posted by GitBox <gi...@apache.org>.
hachikuji commented on PR #12674:
URL: https://github.com/apache/kafka/pull/12674#issuecomment-1264069768

   @dajac Yeah, this is a tough one. I think one perspective we could take is that consumers which have not implemented the protocol should not be supported. So perhaps we should only check whether leader epoch is provided and reject the request otherwise. After all , even if follower fetching is intended, the user might still encounter spurious `OFFSET_OUT_OF_RANGE` errors if the client has not implemented the protocol. Alternatively, maybe we should not return that error in the first place for clients which are not providing the leader epoch? We could return `OFFSET_NOT_AVAILABLE` instead if the leader epoch is -1. Would that work?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] showuon commented on a diff in pull request #12674: KAFKA-14255: Fetching from follower should be disallowed if fetch from follower is disabled

Posted by GitBox <gi...@apache.org>.
showuon commented on code in PR #12674:
URL: https://github.com/apache/kafka/pull/12674#discussion_r978329491


##########
core/src/test/scala/unit/kafka/server/FetchRequestTest.scala:
##########
@@ -285,8 +315,8 @@ class FetchRequestTest extends BaseFetchRequestTest {
 
     // Check follower error codes
     val followerId = TestUtils.findFollowerId(topicPartition, servers)
-    assertResponseErrorForEpoch(Errors.NONE, followerId, Optional.empty())
-    assertResponseErrorForEpoch(Errors.NONE, followerId, Optional.of(secondLeaderEpoch))
+    assertResponseErrorForEpoch(Errors.NOT_LEADER_OR_FOLLOWER, followerId, Optional.empty())
+    assertResponseErrorForEpoch(Errors.NOT_LEADER_OR_FOLLOWER, followerId, Optional.of(secondLeaderEpoch))

Review Comment:
   Sounds good to me! Thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] dajac commented on pull request #12674: KAFKA-14255: Fetching from follower should be disallowed if fetch from follower is disabled

Posted by GitBox <gi...@apache.org>.
dajac commented on PR #12674:
URL: https://github.com/apache/kafka/pull/12674#issuecomment-1273387524

   > I think one perspective we could take is that consumers which have not implemented the protocol should not be supported. So perhaps we should only check whether leader epoch is provided and reject the request otherwise. After all, even if follower fetching is intended, the user might still encounter spurious `OFFSET_OUT_OF_RANGE` errors if the client has not implemented the protocol.
   
   That would be ideal but that seems quite brutal for existing deployment. There are definitely a few users using those clients with follower fetching. I wonder if we should consider something like bumping the fetch request version and do this as well in order to strengthen the protocol. That would at least give a chance to those clients to rectify their implementation if they want to use newer fetch versions in the future. I am not sure that it is worth it though.
   
   > Alternatively, maybe we should not return that error in the first place from followers if the client is not providing the leader epoch? We could return `OFFSET_NOT_AVAILABLE` instead if the leader epoch is -1. Would that work?
   
   librdkafka does not handle `OFFSET_NOT_AVAILABLE` well, unfortunately. It only retries the fetch request with a small backoff when it receives it but it does not refresh its metadata so the client would likely not rediscover the correct leader.
   
   It seems that there is no good solution to mitigate the issue for those existing clients. We've discussed the following approaches:
   1) Disallow fetch from follower when there is no replica selector. This seems pretty safe but could be an issue when the cluster is rolled to enable the selector. We could make the config dynamic to mitigate this. Clients using follower fetching would still be subject to the issue though.
   2) Disallow fetch from follower when there is no leader epoch and no rack id in the request. This would ensure that existing clients that does not use follower fetching are safe. However, we have no guarantee that it would not break existing deployment because a selector does not necessary use rack.id. Clients using follower fetching would still be subject to the issue though.
   3) Disallow fetch from follower when there is no leader epoch in the request. This would prevent existing clients from working.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] ijuma commented on pull request #12674: KAFKA-14255: Fetching from follower should be disallowed if fetch from follower is disabled

Posted by GitBox <gi...@apache.org>.
ijuma commented on PR #12674:
URL: https://github.com/apache/kafka/pull/12674#issuecomment-1256201486

   This change makes sense. I guess the other interesting case is if fetch from follower _is_ enabled on the broker, but the client is not sending the epoch (KIP-320). What should we do in that case? Is it safe to let it through? We may need another JIRA to discuss that case, which is not as clear, but still potentially problematic.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] ijuma commented on pull request #12674: KAFKA-14255: Fetching from follower should be disallowed if fetch from follower is disabled

Posted by GitBox <gi...@apache.org>.
ijuma commented on PR #12674:
URL: https://github.com/apache/kafka/pull/12674#issuecomment-1259463709

   We could document that fetch from follower must first be enabled on the cluster before it's enabled on the client _if_ KIP-320 is not supported by the client.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] hachikuji commented on pull request #12674: KAFKA-14255: Fetching from follower should be disallowed if fetch from follower is disabled

Posted by GitBox <gi...@apache.org>.
hachikuji commented on PR #12674:
URL: https://github.com/apache/kafka/pull/12674#issuecomment-1273790399

   Using `OFFSET_NOT_AVAILABLE` on followers seems to be the right direction I think. It is the only approach we've discussed which addresses the spurious out of range issue for users who want follower fetching on these clients. This error can already be returned after a normal leader change, so clients must handle it. 
   
   Perhaps we could make the behavior more precise instead of blinding returning `OFFSET_NOT_AVAILABLE` for all fetches above the local end offset. The follower doesn't know if the offset is out of range or not since its log is always behind the leader, so we can argue it is incorrect for it to return the error in the first place. But it can find out by querying the leader. In fact, the `Fetch` request already returns the leader's end offset. Could we make use of that? Maybe we could hold out-of-range fetches in purgatory until the follower knows whether or not the leader has the requested offset?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] dajac commented on pull request #12674: KAFKA-14255: Fetching from follower should be disallowed if fetch from follower is disabled

Posted by GitBox <gi...@apache.org>.
dajac commented on PR #12674:
URL: https://github.com/apache/kafka/pull/12674#issuecomment-1259370231

   @hachikuji Yeah, that's right. If a client already specifies the rack id before the cluster sets up a replica selector, the client won't be able to fetch until all the brokers have the selector. An alternative to the current proposal would be to check wether the rack id is present or not. We could for instance disallow fetching from a follower if the rack id is null and the leader epoch is -1. That would limit the fix to the set of clients that implemented FFF without KIP-320. Do you think that would be better?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] dajac commented on pull request #12674: KAFKA-14255: Fetching from follower should be disallowed if fetch from follower is disabled

Posted by GitBox <gi...@apache.org>.
dajac commented on PR #12674:
URL: https://github.com/apache/kafka/pull/12674#issuecomment-1273549678

   Yes, that's the 4th option.
   
   1. Yeah, that's right but librdkafka is not the only one concerned. It is not clear to me how Sarama would handle this for instance. The compatibility story is quite complicated here.
   2. 5 minutes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] dajac commented on pull request #12674: KAFKA-14255: Fetching from follower should be disallowed if fetch from follower is disabled

Posted by GitBox <gi...@apache.org>.
dajac commented on PR #12674:
URL: https://github.com/apache/kafka/pull/12674#issuecomment-1273698361

   Yeah. The delay may be acceptable in this case. I am actually more concerned by the other clients out there. It is hard to really know if this change would be OK for all of them, knowing that some clients have partially implemented the fetch protocol.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] ijuma commented on pull request #12674: KAFKA-14255: Fetching from follower should be disallowed if fetch from follower is disabled

Posted by GitBox <gi...@apache.org>.
ijuma commented on PR #12674:
URL: https://github.com/apache/kafka/pull/12674#issuecomment-1273482277

   Wasn't the 4th option:
   
   4. Return `OFFSET_NOT_AVAILABLE` instead of `OFFSET_OUT_OF_RANGE` if the leader epoch is not provided?
   
   > librdkafka does not handle OFFSET_NOT_AVAILABLE well, unfortunately. It only retries the fetch request with a small backoff when it receives it but it does not refresh its metadata so the client would likely not rediscover the correct leader.
   
   Two things:
   1. We could change librdkafka to refresh its metadata and backport the fix to older versions.
   2. How long does librdkafka retry for before giving up?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] dajac commented on pull request #12674: KAFKA-14255: Fetching from follower should be disallowed if fetch from follower is disabled

Posted by GitBox <gi...@apache.org>.
dajac commented on PR #12674:
URL: https://github.com/apache/kafka/pull/12674#issuecomment-1276256629

   Jason and I discussed offline. Returning `OFFSET_NOT_AVAILABLE` is not really appropriate here because clients may not expect it on the fetch path. The issue is that we have never implemented it on the fetch path so it is not clear what the impact of returning it on non-java clients would be. We discussed another alternative that I prototypes here: https://github.com/apache/kafka/pull/12734.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] dajac closed pull request #12674: KAFKA-14255: Fetching from follower should be disallowed if fetch from follower is disabled

Posted by GitBox <gi...@apache.org>.
dajac closed pull request #12674: KAFKA-14255: Fetching from follower should be disallowed if fetch from follower is disabled
URL: https://github.com/apache/kafka/pull/12674


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] hachikuji commented on pull request #12674: KAFKA-14255: Fetching from follower should be disallowed if fetch from follower is disabled

Posted by GitBox <gi...@apache.org>.
hachikuji commented on PR #12674:
URL: https://github.com/apache/kafka/pull/12674#issuecomment-1273652187

   > librdkafka does not handle OFFSET_NOT_AVAILABLE well, unfortunately. It only retries the fetch request with a small backoff when it receives it but it does not refresh its metadata so the client would likely not rediscover the correct leader.
   
   Hmm, perhaps that handling is good enough? In the event that the out of range error was spurious, the follower would soon catch up to the leader. I assume that is the common case unless unclean election is enabled. It does not seem like a big problem if there is a delay finding a genuine out of range case. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] dajac commented on pull request #12674: KAFKA-14255: Fetching from follower should be disallowed if fetch from follower is disabled

Posted by GitBox <gi...@apache.org>.
dajac commented on PR #12674:
URL: https://github.com/apache/kafka/pull/12674#issuecomment-1256427128

   > This change makes sense. I guess the other interesting case is if fetch from follower _is_ enabled on the broker, but the client is not sending the epoch (KIP-320). What should we do in that case? Is it safe to let it through? We may need another JIRA to discuss that case, which is not as clear, but still potentially problematic.
   
   I think that one fundamental issue is that a follower is accessible even if it has not caught-up yet with the leader. The follower might be far behind for instance or even have wrong records if the truncation did not happen yet. It seems to me that we should make it inaccessible until the follower has at least got the updated high watermark from the leader. Once we have the HWM, we could perhaps also use OFFSET_NOT_AVAILABLE if the requested offset are not available yet. I will file a jira and put my thoughts in it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] dajac commented on a diff in pull request #12674: KAFKA-14255: Fetching from follower should be disallowed if fetch from follower is disabled

Posted by GitBox <gi...@apache.org>.
dajac commented on code in PR #12674:
URL: https://github.com/apache/kafka/pull/12674#discussion_r977354778


##########
core/src/test/scala/unit/kafka/server/FetchRequestTest.scala:
##########
@@ -285,8 +315,8 @@ class FetchRequestTest extends BaseFetchRequestTest {
 
     // Check follower error codes
     val followerId = TestUtils.findFollowerId(topicPartition, servers)
-    assertResponseErrorForEpoch(Errors.NONE, followerId, Optional.empty())
-    assertResponseErrorForEpoch(Errors.NONE, followerId, Optional.of(secondLeaderEpoch))
+    assertResponseErrorForEpoch(Errors.NOT_LEADER_OR_FOLLOWER, followerId, Optional.empty())
+    assertResponseErrorForEpoch(Errors.NOT_LEADER_OR_FOLLOWER, followerId, Optional.of(secondLeaderEpoch))

Review Comment:
   Note: The epoch is correctly validated here but the request is now rejected because the cluster does not have a replica selector.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] dajac commented on a diff in pull request #12674: KAFKA-14255: Fetching from follower should be disallowed if fetch from follower is disabled

Posted by GitBox <gi...@apache.org>.
dajac commented on code in PR #12674:
URL: https://github.com/apache/kafka/pull/12674#discussion_r978308464


##########
core/src/test/scala/unit/kafka/server/FetchRequestTest.scala:
##########
@@ -285,8 +315,8 @@ class FetchRequestTest extends BaseFetchRequestTest {
 
     // Check follower error codes
     val followerId = TestUtils.findFollowerId(topicPartition, servers)
-    assertResponseErrorForEpoch(Errors.NONE, followerId, Optional.empty())
-    assertResponseErrorForEpoch(Errors.NONE, followerId, Optional.of(secondLeaderEpoch))
+    assertResponseErrorForEpoch(Errors.NOT_LEADER_OR_FOLLOWER, followerId, Optional.empty())
+    assertResponseErrorForEpoch(Errors.NOT_LEADER_OR_FOLLOWER, followerId, Optional.of(secondLeaderEpoch))

Review Comment:
   Yeah, I was thinking about this as well. The point is that this is the correct and expected behavior in this case so I am inclined to keep it as it is now. However, it would be great to have a test suite with a replica selector enabled. I can do this as a follow-up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] showuon commented on a diff in pull request #12674: KAFKA-14255: Fetching from follower should be disallowed if fetch from follower is disabled

Posted by GitBox <gi...@apache.org>.
showuon commented on code in PR #12674:
URL: https://github.com/apache/kafka/pull/12674#discussion_r978213955


##########
core/src/main/scala/kafka/server/KafkaApis.scala:
##########
@@ -949,14 +947,17 @@ class KafkaApis(val requestChannel: RequestChannel,
       val fetchMaxBytes = Math.min(Math.min(fetchRequest.maxBytes, config.fetchMaxBytes), maxQuotaWindowBytes)
       val fetchMinBytes = Math.min(fetchRequest.minBytes, fetchMaxBytes)
 
-      val clientMetadata: Option[ClientMetadata] = if (versionId >= 11) {
-        // Fetch API version 11 added preferred replica logic
+      val clientMetadata = if (versionId >= 11 && replicaManager.hasReplicaSelector) {
+        // Fetching from follower is only supported from Fetch API version 11. Moreover, we
+        // only allow it if the broker has a replica selector configured. If it does not, there

Review Comment:
   nit: there [is]



##########
core/src/test/scala/unit/kafka/server/FetchRequestTest.scala:
##########
@@ -285,8 +315,8 @@ class FetchRequestTest extends BaseFetchRequestTest {
 
     // Check follower error codes
     val followerId = TestUtils.findFollowerId(topicPartition, servers)
-    assertResponseErrorForEpoch(Errors.NONE, followerId, Optional.empty())
-    assertResponseErrorForEpoch(Errors.NONE, followerId, Optional.of(secondLeaderEpoch))
+    assertResponseErrorForEpoch(Errors.NOT_LEADER_OR_FOLLOWER, followerId, Optional.empty())
+    assertResponseErrorForEpoch(Errors.NOT_LEADER_OR_FOLLOWER, followerId, Optional.of(secondLeaderEpoch))

Review Comment:
   I'm afraid this will cause some confusion or possibly hide some potential bug.
   I'm thinking we can override the beforeEach to create a test broker with replica selector config for this test to avoid `NOT_LEADER_OR_FOLLOWER` error. ex:
   
   ```java
   class FetchRequestTest extends BaseFetchRequestTest {
     var shouldCreateReplicaSelector = false
   
     @BeforeEach
     override def setUp(testInfo: TestInfo): Unit = {
       if (testInfo.getDisplayName().equals("testCurrentEpochValidation")) {
         shouldCreateReplicaSelector = true
       }
       super.setUp(testInfo)
     }
   
     override def generateConfigs: Seq[KafkaConfig] = {
       TestUtils.createBrokerConfigs(1, zkConnect).map( p => {
         if (shouldCreateReplicaSelector)
           p.setProperty("replica.selector.class", "mock.class.name")
         KafkaConfig.fromProps(p)
       })
     }
   ```
   
   WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org