You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Joel Koshy (JIRA)" <ji...@apache.org> on 2012/08/08 01:49:10 UTC

[jira] [Updated] (KAFKA-385) RequestPurgatory enhancements - expire/checkSatisfy issue; add jmx beans

     [ https://issues.apache.org/jira/browse/KAFKA-385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joel Koshy updated KAFKA-385:
-----------------------------

    Attachment: KAFKA-385-v1.patch
                graphite_explorer.jpg
                example_dashboard.jpg

Summary of changes and notes:

1 - Fixed the synchronization issue (raised in KAFKA-353) between
  checkSatisfied and expire by synchronizing on the DelayedItem.

2 - Added request purgatory metrics using the metrics-core library. Also
  added support for csv/ganglia/graphite reporters which I think is useful -
  e.g., I attached a graphite dashboard that was pretty easy to whip up. It
  should be a breeze to use metrics-core for other stats in Kafka.

3 - This brings in dependencies on metrics and slf4j, both with Apache
  compatible licenses. I don't know of any specific best-practices in using
  metrics-core as I have not used it before, so it would be great if people
  with experience using it glance over this patch.

4 - It's a bit hard to tell right now which metrics are useful and which are
  pointless/redundant.  We can iron that out over time.

5 - Some metrics are only global and both global and per-key (which I think
  is useful to have, e.g., to get a quick view of which partitions are
  slower).  E.g., it helped to see (in the attached screen shots) that fetch
  requests were all expiring - and it turned out to be a bug in how
  DelayedFetch requests from followers are checked for satisfaction.  The
  issue is that maybeUnblockDelayedFetch is only called if required acks is
  0/1. We need to call it always - in the FetchRequestPurgatory
  checkSatisfied method, if it is a follower request then we need to use
  logendoffset to determine the available bytes to the fetch request, and HW
  if it is a non-follower request. I fixed it to always check
  availableFetchBytes, but it can be made a little more efficient by having
  the DelayedFetch request keep track of currently available bytes in each
  topic-partition key.

6 - I realized that both the watchersForKey and per-key metrics pools keep
  growing.  It may be useful to have a simple garbage collector in the Pool
  class that garbage collects entries that are stale (e.g., due to a
  leader-change), but this is non-critical.

7 - I needed to maintain DelayedRequest metrics outside the purgatory:
  because the purgatory itself is abstract and does not have internal
  knowledge of delayed requests and their keys. Note that these metrics are
  for delayed requests - i.e., these metrics are not updated for those
  requests that are satisfied immediately without going through the
  purgatory.

8 - There is one subtlety with producer throughput: I wanted to keep per-key
  throughput, so the metric is updated on individual key satisfaction. This
  does not mean that the DelayedProduce itself will be satisfied - i.e,.
  what the metric reports is an upper-bound since some DelayedProduce
  requests may have expired.

9 - I think it is better to wait for Kafka-376 to go in first. In this
  patch, I hacked a simpler version of that patch - i.e., in
  availableFetchBytes, I check the logEndOffset instead of the
  high-watermark. Otherwise, follower fetch requests would see zero
  available bytes. Of course, this hack now breaks non-follower fetch
  requests.

10 - KafkaApis is getting pretty big - I can try and move DelayedMetrics out
  if that helps although I prefer having it inside since all the
  DelayedRequests and purgatories are in there.

11 - There may be some temporary edits to start scripts/log4j that I will
  revert in the final patch.

What's left to do:

a - This was a rather painful rebase, so I need to review in case I missed
  something.

b - Optimization described above: DelayedFetch should keep track of
  bytesAvailable for each key and FetchRequestPurgatory's checkSatisfied
  should take a topic, partition and compute availableBytes for just that
  key.

c - The JMX operations to start and stop the reporters are not working
  properly. I think I understand the issue, but will fix later.
                
> RequestPurgatory enhancements - expire/checkSatisfy issue; add jmx beans
> ------------------------------------------------------------------------
>
>                 Key: KAFKA-385
>                 URL: https://issues.apache.org/jira/browse/KAFKA-385
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Joel Koshy
>            Assignee: Joel Koshy
>             Fix For: 0.8
>
>         Attachments: KAFKA-385-v1.patch, example_dashboard.jpg, graphite_explorer.jpg
>
>
> As discussed in KAFKA-353:
> 1 - There is potential for a client-side race condition in the implementations of expire and checkSatisfied. We can just synchronize on the DelayedItem.
> 2 - Would be good to add jmx beans to facilitate monitoring RequestPurgatory stats.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira