You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@samza.apache.org by "Mark Mindenhall (JIRA)" <ji...@apache.org> on 2016/10/31 18:37:58 UTC

[jira] [Created] (SAMZA-1044) Checkpointing requires log.cleaner.enable=true

Mark Mindenhall created SAMZA-1044:
--------------------------------------

             Summary: Checkpointing requires log.cleaner.enable=true
                 Key: SAMZA-1044
                 URL: https://issues.apache.org/jira/browse/SAMZA-1044
             Project: Samza
          Issue Type: Bug
          Components: docs
         Environment: linux
            Reporter: Mark Mindenhall
            Priority: Minor


We're running Samza 0.9.1 with kafka 0.8.2.1, which has a default setting of {{log.cleaner.enable=false}}.  We didn't think we needed to enable this, as we never created any topics with {{cleanup.policy=compact}}.  However, this morning we had a disk alert, and when I took a look on the broker that triggered the alert, one of the Samza checkpoint topics was consuming 29GB within the {{/logs}} folder.

Long story short, I eventually figured out that all of the checkpoint topics were created with {{cleanup.policy=compact}}, and were growing unbounded.  I set {{log.cleaner.enable=true}} on each broker, and restarted them.  Within minutes, the 29GB was reduced to a 200-300KB.

I thought I must have missed this when I created our jobs with checkpointing enabled, so I went and scoured the docs.  There's no mention of the {{log.cleaner.enable}} setting within the documentation (unless I missed it _again_).

I should add that we've been running most of these jobs for about a year, and I noticed that each time we would deploy, it would take longer and longer to transition from {{ACCEPTED}} to {{RUNNING}} in the YARN cluster.  Eventually, it was taking 10-15 minutes per job, and we didn't understand why.  After bouncing our staging cluster with {{log.cleaner.enable=true}} (and letting the log cleaner finish its work), I redeployed one of our jobs, and it once again took 15-20 seconds from {{ACCEPTED}} to {{RUNNING}}.

Please mention in the documentation that {{log.cleaner.enable}} must be set to {{true}} for checkpointing to work correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)