You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Chris Riccomini (JIRA)" <ji...@apache.org> on 2015/03/13 20:56:38 UTC

[jira] [Updated] (SAMZA-592) getSystemStreamMetadata loops forever when it receives bad metadata

     [ https://issues.apache.org/jira/browse/SAMZA-592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Riccomini updated SAMZA-592:
----------------------------------
    Attachment: SAMZA-592-0.patch

Attaching patch. RB at:

https://reviews.apache.org/r/32052/

Changes:

# Wrote a KafkaUtil.maybeThrowException method that suppresses ReplicaNotAvailableExceptions.
# Updated everything to use (1).
# Removed partition-level errorCode checks in KafkaSystemAdmin.getTopicsAndPartitionsByBroker. The .getOffsets call checks errorCodes from its responses, and will trigger a full metadata refresh if there are errors.

Also ran zopkio tests, and [~navina]'s SAMZA-394 torture test (yet to be integrated into zopkio).

> getSystemStreamMetadata loops forever when it receives bad metadata
> -------------------------------------------------------------------
>
>                 Key: SAMZA-592
>                 URL: https://issues.apache.org/jira/browse/SAMZA-592
>             Project: Samza
>          Issue Type: Bug
>          Components: kafka
>    Affects Versions: 0.9.0
>            Reporter: Chris Riccomini
>             Fix For: 0.9.0
>
>         Attachments: SAMZA-592-0.patch
>
>
> While investigating SAMZA-576, [~ewencp] discovered a bug in the KafkaSystemAdmin that causes getSystemStreamMetadata to go into an infinite loop when it receives bad metadata from a broker. See [this|https://issues.apache.org/jira/browse/SAMZA-576?focusedCommentId=14356349&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14356349] comment.
> We experienced this bug last week. We were running a healthy cluster down with topics that have a replication factor of 2. We brought down a *single* broker, and jobs would not start while the broker was down. The containers just repeated this error message:
> {noformat}
>   2015-02-24 22:36:43 KafkaSystemAdmin [WARN] Unable to fetch last offsets for streams [some-topic] due to kafka.common.ReplicaNotAvailableException. Retrying.
> {noformat}
> Checking the cluster showed that all partitions were still available, and bringing down the single broker resulted in proper leadership failover. Samza, however, was not able to start.
> I was told by [~clarkhaskins] that it was actually safe to ignore the ReplicaNotAvailableException when fetching metadata. [~ewencp], can you confirm this?
> It seems that there are two issues:
> # KafkaSystemAdmin.getSystemStreamMetadata never refreshes data when its metadata fetch results in an error code.
> # We should allow the metadata fetch to proceed, rather than throwing an exception, if there is a ReplicaNotAvailableException during metadata refreshes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)