You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Chris Riccomini (JIRA)" <ji...@apache.org> on 2015/03/13 20:56:38 UTC
[jira] [Updated] (SAMZA-592) getSystemStreamMetadata loops forever
when it receives bad metadata
[ https://issues.apache.org/jira/browse/SAMZA-592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris Riccomini updated SAMZA-592:
----------------------------------
Attachment: SAMZA-592-0.patch
Attaching patch. RB at:
https://reviews.apache.org/r/32052/
Changes:
# Wrote a KafkaUtil.maybeThrowException method that suppresses ReplicaNotAvailableExceptions.
# Updated everything to use (1).
# Removed partition-level errorCode checks in KafkaSystemAdmin.getTopicsAndPartitionsByBroker. The .getOffsets call checks errorCodes from its responses, and will trigger a full metadata refresh if there are errors.
Also ran zopkio tests, and [~navina]'s SAMZA-394 torture test (yet to be integrated into zopkio).
> getSystemStreamMetadata loops forever when it receives bad metadata
> -------------------------------------------------------------------
>
> Key: SAMZA-592
> URL: https://issues.apache.org/jira/browse/SAMZA-592
> Project: Samza
> Issue Type: Bug
> Components: kafka
> Affects Versions: 0.9.0
> Reporter: Chris Riccomini
> Fix For: 0.9.0
>
> Attachments: SAMZA-592-0.patch
>
>
> While investigating SAMZA-576, [~ewencp] discovered a bug in the KafkaSystemAdmin that causes getSystemStreamMetadata to go into an infinite loop when it receives bad metadata from a broker. See [this|https://issues.apache.org/jira/browse/SAMZA-576?focusedCommentId=14356349&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14356349] comment.
> We experienced this bug last week. We were running a healthy cluster down with topics that have a replication factor of 2. We brought down a *single* broker, and jobs would not start while the broker was down. The containers just repeated this error message:
> {noformat}
> 2015-02-24 22:36:43 KafkaSystemAdmin [WARN] Unable to fetch last offsets for streams [some-topic] due to kafka.common.ReplicaNotAvailableException. Retrying.
> {noformat}
> Checking the cluster showed that all partitions were still available, and bringing down the single broker resulted in proper leadership failover. Samza, however, was not able to start.
> I was told by [~clarkhaskins] that it was actually safe to ignore the ReplicaNotAvailableException when fetching metadata. [~ewencp], can you confirm this?
> It seems that there are two issues:
> # KafkaSystemAdmin.getSystemStreamMetadata never refreshes data when its metadata fetch results in an error code.
> # We should allow the metadata fetch to proceed, rather than throwing an exception, if there is a ReplicaNotAvailableException during metadata refreshes.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)