You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Neha Narkhede (JIRA)" <ji...@apache.org> on 2014/09/12 04:30:33 UTC
[jira] [Updated] (KAFKA-1382) Update zkVersion on partition state
update failures
[ https://issues.apache.org/jira/browse/KAFKA-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neha Narkhede updated KAFKA-1382:
---------------------------------
Fix Version/s: 0.8.1.2
> Update zkVersion on partition state update failures
> ---------------------------------------------------
>
> Key: KAFKA-1382
> URL: https://issues.apache.org/jira/browse/KAFKA-1382
> Project: Kafka
> Issue Type: Bug
> Reporter: Joel Koshy
> Assignee: Sriharsha Chintalapani
> Fix For: 0.8.2, 0.8.1.2
>
> Attachments: KAFKA-1382.patch, KAFKA-1382_2014-05-30_21:19:21.patch, KAFKA-1382_2014-05-31_15:50:25.patch, KAFKA-1382_2014-06-04_12:30:40.patch, KAFKA-1382_2014-06-07_09:00:56.patch, KAFKA-1382_2014-06-09_18:23:42.patch, KAFKA-1382_2014-06-11_09:37:22.patch, KAFKA-1382_2014-06-16_13:50:16.patch, KAFKA-1382_2014-06-16_14:19:27.patch
>
>
> Our updateIsr code is currently:
> private def updateIsr(newIsr: Set[Replica]) {
> debug("Updated ISR for partition [%s,%d] to %s".format(topic, partitionId, newIsr.mkString(",")))
> val newLeaderAndIsr = new LeaderAndIsr(localBrokerId, leaderEpoch, newIsr.map(r => r.brokerId).toList, zkVersion)
> // use the epoch of the controller that made the leadership decision, instead of the current controller epoch
> val (updateSucceeded, newVersion) = ZkUtils.conditionalUpdatePersistentPath(zkClient,
> ZkUtils.getTopicPartitionLeaderAndIsrPath(topic, partitionId),
> ZkUtils.leaderAndIsrZkData(newLeaderAndIsr, controllerEpoch), zkVersion)
> if (updateSucceeded){
> inSyncReplicas = newIsr
> zkVersion = newVersion
> trace("ISR updated to [%s] and zkVersion updated to [%d]".format(newIsr.mkString(","), zkVersion))
> } else {
> info("Cached zkVersion [%d] not equal to that in zookeeper, skip updating ISR".format(zkVersion))
> }
> We encountered an interesting scenario recently when a large producer fully
> saturated the broker's NIC for over an hour. The large volume of data led to
> a number of ISR shrinks (and subsequent expands). The NIC saturation
> affected the zookeeper client heartbeats and led to a session timeout. The
> timeline was roughly as follows:
> - Attempt to expand ISR
> - Expansion written to zookeeper (confirmed in zookeeper transaction logs)
> - Session timeout after around 13 seconds (the configured timeout is 20
> seconds) so that lines up.
> - zkclient reconnects to zookeeper (with the same session ID) and retries
> the write - but uses the old zkVersion. This fails because the zkVersion
> has already been updated (above).
> - The ISR expand keeps failing after that and the only way to get out of it
> is to bounce the broker.
> In the above code, if the zkVersion is different we should probably update
> the cached version and even retry the expansion until it succeeds.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)