You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by GitBox <gi...@apache.org> on 2021/06/02 12:05:13 UTC

[GitHub] [kafka] tombentley opened a new pull request #10807: KAFKA-12797: Log the evictor of fetch sessions

tombentley opened a new pull request #10807:
URL: https://github.com/apache/kafka/pull/10807


   [KAFKA-12797](https://issues.apache.org/jira/browse/KAFKA-12797) describes the difficulty in discovering bad clients that are responsible for evicting every other non-privileged session from the cache. While the ideal solution would be to have a quota (so a user ends up evicting only their own sessions once quota was reached), the fetch session cache is very performance sensitive, and this problem is very rare, so logging seems like a reasonable first step.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] mimaison commented on a change in pull request #10807: KAFKA-12797: Log the evictor of fetch sessions

Posted by GitBox <gi...@apache.org>.
mimaison commented on a change in pull request #10807:
URL: https://github.com/apache/kafka/pull/10807#discussion_r684229444



##########
File path: core/src/test/scala/unit/kafka/server/KafkaApisTest.scala
##########
@@ -2262,11 +2262,12 @@ class KafkaApisTest {
       Optional.empty())).asJava
     val fetchMetadata = new JFetchMetadata(0, 0)
     val fetchContext = new FullFetchContext(time, new FetchSessionCache(1000, 100),
-      fetchMetadata, fetchData, false, metadataCache.topicNamesToIds(), false)
+      fetchMetadata, fetchData, false, metadataCache.topicNamesToIds(), false, "my-client-id")

Review comment:
       Not that it matters but should we use something like `principal=my-principal`?
   Same below too.

##########
File path: core/src/main/scala/kafka/server/FetchSession.scala
##########
@@ -684,15 +686,15 @@ class FetchSessionCache(private val maxEntries: Int,
     * @param now        The current time in milliseconds.
     * @return           True if an entry was evicted; false otherwise.
     */
-  def tryEvict(privileged: Boolean, key: EvictableKey, now: Long): Boolean = synchronized {
+  def tryEvict(privileged: Boolean, key: EvictableKey, now: Long, evictor: String): Boolean = synchronized {
     // Try to evict an entry which is stale.
     val lastUsedEntry = lastUsed.firstEntry
     if (lastUsedEntry == null) {
       trace("There are no cache entries to evict.")
       false
     } else if (now - lastUsedEntry.getKey.lastUsedMs > evictionMs) {
       val session = lastUsedEntry.getValue
-      trace(s"Evicting stale FetchSession ${session.id}.")
+      trace(s"Evicting stale FetchSession ${session.id} for $evictor")

Review comment:
       Should we preserve the full stop here and below?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] tombentley commented on pull request #10807: KAFKA-12797: Log the evictor of fetch sessions

Posted by GitBox <gi...@apache.org>.
tombentley commented on pull request #10807:
URL: https://github.com/apache/kafka/pull/10807#issuecomment-852970504


   @dajac please could you review?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] tombentley commented on a change in pull request #10807: KAFKA-12797: Log the evictor of fetch sessions

Posted by GitBox <gi...@apache.org>.
tombentley commented on a change in pull request #10807:
URL: https://github.com/apache/kafka/pull/10807#discussion_r683481734



##########
File path: core/src/main/scala/kafka/server/KafkaApis.scala
##########
@@ -692,6 +692,7 @@ class KafkaApis(val requestChannel: RequestChannel,
       fetchRequest.version,
       fetchRequest.metadata,
       fetchRequest.isFromFollower,
+      s"clientId=${request.context.clientId}, principal=${request.context.principal}",

Review comment:
       > It's not ideal that we have to build the string even if we don't use it.
   
   I couldn't see a nice way to do that. Basing it off the level of the `FetchSessionCache` logger seemed to break encapsulation, since at the call site we don't know how this string is going to be used.
   
   > 
   > In practice, this extra logging is useful if there's a malicious user forcing sessions to roll or if a user is using a broken client (like Sarama 1.26.0). So I wonder if we really need the clientId. While it's nice to have, it's a user-controlled field so this could be problematic for large values. WDYT?
   
   It's a good point that the client controls the `clientId`, and could potentially choose a long one. The value of logging it here isn't that high, so I can remove it. If you're making a point about a clientId being maliciously chosen then that's a bigger problem.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] tombentley commented on pull request #10807: KAFKA-12797: Log the evictor of fetch sessions

Posted by GitBox <gi...@apache.org>.
tombentley commented on pull request #10807:
URL: https://github.com/apache/kafka/pull/10807#issuecomment-853251332


   @weeco you're right it's not super helpful, because you would have to post-process the log to figure out who was causing the problem (i.e. observe that the same principal and client id were the evictor for many successive evictions). However, this patch adds no new logging, so it doesn't really make 2 much worse than it already may be, while at least allowing the problem user to _be_ identified. So I think it's a simple improvement over the current situation, where it's basically impossible to identify the culprit. This PR doesn't preclude finding a better solution later on.
   
   The issue with metrics like `IncrementalFetchSessionsCreatedPerSec<ClientId=([-.\w]+)>` is that that could be a lot of metrics, which most of the time are useless. I don't feel that adding and maintaining that many metrics can really be justified for an issue which is very rarely seen in practice. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] mimaison commented on a change in pull request #10807: KAFKA-12797: Log the evictor of fetch sessions

Posted by GitBox <gi...@apache.org>.
mimaison commented on a change in pull request #10807:
URL: https://github.com/apache/kafka/pull/10807#discussion_r683367337



##########
File path: core/src/main/scala/kafka/server/KafkaApis.scala
##########
@@ -692,6 +692,7 @@ class KafkaApis(val requestChannel: RequestChannel,
       fetchRequest.version,
       fetchRequest.metadata,
       fetchRequest.isFromFollower,
+      s"clientId=${request.context.clientId}, principal=${request.context.principal}",

Review comment:
       It's not ideal that we have to build the string even if we don't use it.
   
   In practice, this extra logging is useful if there's a malicious user forcing sessions to roll or if a user is using a broken client (like Sarama 1.26.0). So I wonder if we really need the clientId. While it's nice to have, it's a user-controlled field so this could be problematic for large values. WDYT?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] weeco commented on pull request #10807: KAFKA-12797: Log the evictor of fetch sessions

Posted by GitBox <gi...@apache.org>.
weeco commented on pull request #10807:
URL: https://github.com/apache/kafka/pull/10807#issuecomment-853223533


   Hey @tombentley , thanks for tackling this! I understand your performance concerns in regards to the quota implementation; I can't judge if a quota for that is actually too expensive. However here are my thoughts regarding your PR:
   
   Unless I missunderstand your proposed changes here, I believe that logging the fetch session slot evictor is currently not super helpful because:
   1. I believe fetch sessions are expected to be evicted. For me it seems like fetch sessions are not proactively relived from the session slots cache. In my clusters I can see the fetch session slots slowly growing over time (hours until all slots are taken) and then fetch sessions are regularly evicted at a rate of 1-3 evictions / second. I noticed this by tracking this via the JMX metrics `IncrementalFetchSessionEvictionsPerSec` and `NumIncrementalFetchSessions` (introduced in KIP-227)
   2. A potential attacker (or misbehaving client such as sarama v1.26.0) can quickly create new fetch sessions (thousands of fetch sessions per seconds) which would then cause one log line for each eviction. This again can cause high CPU usage (for printing the log line) or at least cause problems for many logging pipelines.
   
   **I could envision two other solutions to mitigate the issue around leaking fetch sessions:**
   
   1. Create one fetch session slot cache per client with a fixed, relatively low number of available fetch sessions. This way a single client can not negatively impact other clients. Possibly this is even a performance improvement because the lock contention as required in one large cache that is concurrently accessed from many fetch different fetch requests may be reduced?
   2. Introduce a new counter metric that logs the number of fetch sessions created per client id. Something like `IncrementalFetchSessionsCreatedPerSec<ClientId=([-.\w]+)>`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org