You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@samza.apache.org by "Edi Bice (JIRA)" <ji...@apache.org> on 2015/09/30 17:10:06 UTC

[jira] [Commented] (SAMZA-679) Optimize CoordinatorStream's bootstrap mechanism

    [ https://issues.apache.org/jira/browse/SAMZA-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14936967#comment-14936967 ] 

Edi Bice commented on SAMZA-679:
--------------------------------

Aha maybe this is what's been plaguing my job! I have a job which uses job.coordinator.system=kafka and now the respective folder (__samza_coordinator_my-topic) under kafka-logs is about 1.3Gb. I made some code and configuration changes to the job, killed it and have been trying to launch it. Was surprised to see OutOfMemory errors with some very large heap settings and was wondering why it was consuming so much memory. Here are some of the __samza_coordinator_my-topic settings:
ReplicationFactor:3     Configs:segment.bytes=26214400,retention.ms=3600000,cleanup.policy=compact

> Optimize CoordinatorStream's bootstrap mechanism
> ------------------------------------------------
>
>                 Key: SAMZA-679
>                 URL: https://issues.apache.org/jira/browse/SAMZA-679
>             Project: Samza
>          Issue Type: Sub-task
>            Reporter: Naveen Somasundaram
>             Fix For: 0.10.0
>
>
> At present, when the bootstrap using the CoordinatorStreamConsumer, we read all the messages into a set. Which is fine, if log compaction is working, but given that:
> 1. The log compaction can be turned off/broken for whatever reason
> 2. The is time interval between compaction
> We should consider fixing the bootstrap method to hold only the latest checkpoint (Override equals and hascode of the set is one way to go about it)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)