You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/02/07 20:37:00 UTC

[jira] [Commented] (SAMZA-1572) Add fixed retries on failure in KafkaCheckpointManager

    [ https://issues.apache.org/jira/browse/SAMZA-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356032#comment-16356032 ] 

ASF GitHub Bot commented on SAMZA-1572:
---------------------------------------

GitHub user shanthoosh opened a pull request:

    https://github.com/apache/samza/pull/420

    SAMZA-1572: Add fixed retries on failure in KafkaCheckpointManager.

    KafkaCheckpointManager.writeCheckpoint currently goes into a infinite loop when an irrecoverable failure happens, this indefinitely blocks the commit phase (there by preventing processing). Added finite retries (50), which would retry for fixed time in case of failure before giving up. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/shanthoosh/samza add_fixed_retries_in_kafka_checkpoint_manager

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/samza/pull/420.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #420
    
----
commit 8b98814e9d96b17a5772d079c20832f6f094640e
Author: Shanthoosh Venkataraman <sv...@...>
Date:   2018-01-25T22:10:28Z

    SAMZA-1572: Add fixed retries on failure in KafkaCheckpointManager.

----


> Add fixed retries on failure in KafkaCheckpointManager
> ------------------------------------------------------
>
>                 Key: SAMZA-1572
>                 URL: https://issues.apache.org/jira/browse/SAMZA-1572
>             Project: Samza
>          Issue Type: Bug
>            Reporter: Shanthoosh Venkataraman
>            Assignee: Shanthoosh Venkataraman
>            Priority: Major
>
> KafkaCheckpointManager.writeCheckpoint currently goes into a infinite loop when an irrecoverable failure happens, this indefinitely blocks the commit phase (there by preventing processing). This exception is revealed only during the shutdown of the job making shutdown block indefinitely since the markers for shutdown are ignored by runloop which is blocked on commit phase.
> {code:java}
> 2018/01/22 19:18:10.503 WARN [KafkaCheckpointManager]  [] Failed to write checkpoint log partition entry org.apache.samza.checkpoint.kafka.KafkaCheckpointLogKey@8148c5bb: org.apache.samza.system.SystemProducerException: Flush failed. One or more batches of messages were not sent. Retrying. 2018/01/22 19:18:10.604 WARN [KafkaCheckpointManager]  [] Failed to write checkpoint log partition entry org.apache.samza.checkpoint.kafka.KafkaCheckpointLogKey@8148c5bb: org.apache.samza.system.SystemProducerException: Producer was unable to recover from previous exceptio 2018/01/22 19:18:10.804 WARN [KafkaCheckpointManager]  [] Failed to write checkpoint log partition entry org.apache.samza.checkpoint.kafka.KafkaCheckpointLogKey@8148c5bb: org.apache.samza.system.SystemProducerException: Producer was unable to recover from previous exceptio 2018/01/22 19:18:11.204 WARN [KafkaCheckpointManager]  [] Failed to write checkpoint log partition entry org.apache.samza.checkpoint.kafka.KafkaCheckpointLogKey@8148c5bb: org.apache.samza.system.SystemProducerException: Producer was unable to recover from previous exceptio 2018/01/22 19:18:12.005 WARN [KafkaCheckpointManager]  [] Failed to write checkpoint log partition entry org.apache.samza.checkpoint.kafka.KafkaCheckpointLogKey@8148c5bb: org.apache.samza.system.SystemProducerException: Producer was unable to recover from previous exceptio 2018/01/22 19:18:13.605 WARN [KafkaCheckpointManager]  [] Failed to write checkpoint log partition entry org.apache.samza.checkpoint.kafka.KafkaCheckpointLogKey@8148c5bb: org.apache.samza.system.SystemProducerException: Producer was unable to recover from previous exceptio 2018/01/22 19:18:16.805 WARN [KafkaCheckpointManager]  [] Failed to write checkpoint log partition entry org.apache.samza.checkpoint.kafka.KafkaCheckpointLogKey@8148c5bb: org.apache.samza.system.SystemProducerException: Producer was unable to recover from previous exceptio 2018/01/22 19:18:23.205 WARN [KafkaCheckpointManager]  [] Failed to write checkpoint log partition entry org.apache.samza.checkpoint.kafka.KafkaCheckpointLogKey@8148c5bb: org.apache.samza.system.SystemProducerException: Producer was unable to recover from previous exceptio 2018/01/22 19:18:33.206 WARN [KafkaCheckpointManager]  [] Failed to write checkpoint log partition entry org.apache.samza.checkpoint.kafka.KafkaCheckpointLogKey@8148c5bb: org.apache.samza.system.SystemProducerException: Producer was unable to recover from previous exceptio 2018/01/22 19:18:43.206 WARN [KafkaCheckpointManager]  [] Failed to write checkpoint log partition entry org.apache.samza.checkpoint.kafka.KafkaCheckpointLogKey@8148c5bb: org.apache.samza.system.SystemProducerException: Producer was unable to recover from previous exceptio 2018/01/22 19:18:53.206 WARN [KafkaCheckpointManager]  [] Failed to write checkpoint log partition entry org.apache.samza.checkpoint.kafka.KafkaCheckpointLogKey@8148c5bb: org.apache.samza.system.SystemProducerException: Producer was unable to recover from previous exceptio 2018/01/22 19:19:03.207 WARN [KafkaCheckpointManager]  [] Failed to write checkpoint log partition entry org.apache.samza.checkpoint.kafka.KafkaCheckpointLogKey@8148c5bb: org.apache.samza.system.SystemProducerException: Producer was unable to recover from previous exceptio 2018/01/22 19:19:13.207 WARN [KafkaCheckpointManager]  [] Failed to write checkpoint log partition entry org.apache.samza.checkpoint.kafka.KafkaCheckpointLogKey@8148c5bb: org.apache.samza.system.SystemProducerException: Producer was unable to recover from previous exceptio 2018/01/22 19:19:23.207 WARN [KafkaCheckpointManager]  [] Failed to write checkpoint log partition entry org.apache.samza.checkpoint.kafka.KafkaCheckpointLogKey@8148c5bb: org.apache.samza.system.SystemProducerException: Producer was unable to recover from previous exception.. Retrying.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)