You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@atlas.apache.org by "Hemanth Yamijala (JIRA)" <ji...@apache.org> on 2016/06/02 06:44:59 UTC

[jira] [Commented] (ATLAS-801) Atlas hooks would lose messages if Kafka is down for extended period of time

    [ https://issues.apache.org/jira/browse/ATLAS-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311825#comment-15311825 ] 

Hemanth Yamijala commented on ATLAS-801:
----------------------------------------

Starting some analysis notes.

Firstly, I will try to see what can be done to minimize the probability of this happening first. This is low hanging fruit to improve the current situation.

* We need to ensure we configure multiple replicas for ATLAS_HOOK in Kafka. This is already documented as an operational guidance [here|http://atlas.incubator.apache.org/HighAvailability.html] under the *Notification Server* section. We could potentially automate this as part of server setup of Atlas. This was the topic of ATLAS-515.
* We could add some retries to the producer config of Kafka. Currently, we use the default values which is no retries.

I explored other configuration in Kafka producers and feel we are OK there. Specifically:

* *acks* - we use the default value of 1, which is acknowledgement from the leader alone. This gives us a right balance between reliability and throughput.
* *batch.size* - we use the default value of 16KB. Empirically, our message size seems to be about 8 KB. So maybe we send 2 messages per batch. Again, not too much to gain by changing this here I guess.

> Atlas hooks would lose messages if Kafka is down for extended period of time
> ----------------------------------------------------------------------------
>
>                 Key: ATLAS-801
>                 URL: https://issues.apache.org/jira/browse/ATLAS-801
>             Project: Atlas
>          Issue Type: Improvement
>            Reporter: Hemanth Yamijala
>            Assignee: Hemanth Yamijala
>
> All integration hooks in Atlas write messages to Kafka which are picked up by the Atlas server. If communication to Kafka breaks, then this results in loss of metadata messages. This can be mitigated to some extent using multiple replicas for Kafka topics (see ATLAS-515). This JIRA is to see if we can make this even more robust and have some form of store and forward mechanism for increased fault tolerance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)