You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Andy Chambers (JIRA)" <ji...@apache.org> on 2016/03/22 06:22:25 UTC

[jira] [Commented] (SAMZA-592) getSystemStreamMetadata loops forever when it receives bad metadata

    [ https://issues.apache.org/jira/browse/SAMZA-592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15205818#comment-15205818 ] 

Andy Chambers commented on SAMZA-592:
-------------------------------------

I'm seeing this symptom on 0.10.0.

This appears for me when starting a job for the first time on a cluster with none of the topics created. It does not happen in my single node dev setup but does happen in our UAT environment. I'm a samza newbie so is there any tips you could share to help me try to replicate in my own setup? For example things that might make this more likely to happen?

> getSystemStreamMetadata loops forever when it receives bad metadata
> -------------------------------------------------------------------
>
>                 Key: SAMZA-592
>                 URL: https://issues.apache.org/jira/browse/SAMZA-592
>             Project: Samza
>          Issue Type: Bug
>          Components: kafka
>    Affects Versions: 0.9.0
>            Reporter: Chris Riccomini
>            Assignee: Chris Riccomini
>             Fix For: 0.9.0
>
>         Attachments: SAMZA-592-0.patch, SAMZA-592-1.patch, SAMZA-592-2.patch
>
>
> While investigating SAMZA-576, [~ewencp] discovered a bug in the KafkaSystemAdmin that causes getSystemStreamMetadata to go into an infinite loop when it receives bad metadata from a broker. See [this|https://issues.apache.org/jira/browse/SAMZA-576?focusedCommentId=14356349&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14356349] comment.
> We experienced this bug last week. We were running a healthy cluster down with topics that have a replication factor of 2. We brought down a *single* broker, and jobs would not start while the broker was down. The containers just repeated this error message:
> {noformat}
>   2015-02-24 22:36:43 KafkaSystemAdmin [WARN] Unable to fetch last offsets for streams [some-topic] due to kafka.common.ReplicaNotAvailableException. Retrying.
> {noformat}
> Checking the cluster showed that all partitions were still available, and bringing down the single broker resulted in proper leadership failover. Samza, however, was not able to start.
> I was told by [~clarkhaskins] that it was actually safe to ignore the ReplicaNotAvailableException when fetching metadata. [~ewencp], can you confirm this?
> It seems that there are two issues:
> # KafkaSystemAdmin.getSystemStreamMetadata never refreshes data when its metadata fetch results in an error code.
> # We should allow the metadata fetch to proceed, rather than throwing an exception, if there is a ReplicaNotAvailableException during metadata refreshes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)