You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Ewen Cheslack-Postava (JIRA)" <ji...@apache.org> on 2015/03/11 23:37:38 UTC

[jira] [Commented] (SAMZA-592) getSystemStreamMetadata loops forever when it receives bad metadata

    [ https://issues.apache.org/jira/browse/SAMZA-592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357737#comment-14357737 ] 

Ewen Cheslack-Postava commented on SAMZA-592:
---------------------------------------------

It's true that if a ReplicaNotAvailableException can be "ignored" in that you expect that it's a transient issue in the metadata and you just need to wait for the issue to be resolved, the metadata in ZK to be updated, and then another attempt will get the right data. In this particular case, I don't think we need to do anything specific to ReplicaNotAvailableException -- if TopicMetadataCache.getTopicMetadata properly checked all possible locations for error codes it would have triggered the refresh. You still might see this error in the log since you might make a few requests before issues is resolved, but we'd at least be making a new metadata request on each iteration.

> getSystemStreamMetadata loops forever when it receives bad metadata
> -------------------------------------------------------------------
>
>                 Key: SAMZA-592
>                 URL: https://issues.apache.org/jira/browse/SAMZA-592
>             Project: Samza
>          Issue Type: Bug
>          Components: kafka
>    Affects Versions: 0.9.0
>            Reporter: Chris Riccomini
>             Fix For: 0.9.0
>
>
> While investigating SAMZA-576, [~ewencp] discovered a bug in the KafkaSystemAdmin that causes getSystemStreamMetadata to go into an infinite loop when it receives bad metadata from a broker. See [this|https://issues.apache.org/jira/browse/SAMZA-576?focusedCommentId=14356349&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14356349] comment.
> We experienced this bug last week. We were running a healthy cluster down with topics that have a replication factor of 2. We brought down a *single* broker, and jobs would not start while the broker was down. The containers just repeated this error message:
> {noformat}
>   2015-02-24 22:36:43 KafkaSystemAdmin [WARN] Unable to fetch last offsets for streams [some-topic] due to kafka.common.ReplicaNotAvailableException. Retrying.
> {noformat}
> Checking the cluster showed that all partitions were still available, and bringing down the single broker resulted in proper leadership failover. Samza, however, was not able to start.
> I was told by [~clarkhaskins] that it was actually safe to ignore the ReplicaNotAvailableException when fetching metadata. [~ewencp], can you confirm this?
> It seems that there are two issues:
> # KafkaSystemAdmin.getSystemStreamMetadata never refreshes data when its metadata fetch results in an error code.
> # We should allow the metadata fetch to proceed, rather than throwing an exception, if there is a ReplicaNotAvailableException during metadata refreshes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)