You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@eagle.apache.org by "javacaoyu@163.com" <ja...@163.com> on 2016/12/02 03:38:43 UTC

Report an bug or unreasonable design

Hi eagle dev:
    We use eagle for Cloudera CDH cluster. We use eagle with official website tutorial.
    runing OK for long time .
    But today , the kafka cluster has crash , and because the kafka crash lead to namenode error.
    The standby namenode auto trans to active status and lead to hadoop cluster error.

    We think , the send hdfs_audit log should be a single daemon , Should not be configuration namenode log4j file , and by namenode start to load kakfa jars

    Because the namenode and 'send to kafka' these two in a single jvm daemon , that can crash namenode cause the kafka down.

    We think, eagle should design a single daemon to send hdfs audit log to kafka,  should be decoupling not enhanced coupling.



    English is not good ， you can understand is ok.

    I know eagle dev team have some chinese people so you team should understand chinese:
    我们通过官方文档去配置eagle，按照文档说的配置namenode的log4j配置并将eagle的相关jar包放入namenode的classpath下，当重启namenode后，成功将
    hdfs audit log 发送到kafka 并稳定运行了一段时间，
    但是今天，kafka集群宕机了，导致namenode出现问题，datanode连接namenode出现超时，备用namenode开始接管集群，但是原先的活动namenode仍然标记为活动状态，最终导致
    hadoop集群出现问题，
    排查问题后发现，当kafka宕机后，namenode也出现异常，并导致了namenode的问题出现。

    我们建议，不应该将发送至kafka的功能绑定到namenode之中，应当将这两者解耦，设计一个单独的进程去读取audit日志文件并发送至kafka
    这样的话 当kafka宕机后 不会对namenode造成影响。

    谢谢。



javacaoyu@163.com

Re: Report an bug or unreasonable design

Posted by Hao Chen <ha...@apache.org>.

Thanks for reporting the problem. You are right, it's a very reasonable
concern.

Eagle is designed to decouple the alert engine and data source with some
messaging bus like Apache Kafka, and in fact, you could use any of log
shipping approaches like log4j-kafka-appender (mentioned here ), logstash,
syslogd, filebeat and so on. So in order to avoid any potential impact to
namenode, you could try to use noninvasive solutions like
logstash/filebeat/syslogd.

[以下为中文回复]

非常感谢使用eagle，并向社区反馈问题，你的担心是非常合理的。

Eagle的设计时通过类似Kafka的Messaging Bus方案解耦alert
engine和datasource的，所以并不要求绑定任何一种日志收集agent，所以在生产环境下，为了免除对namenode的影响的担心，你可以尝试使用一些非侵入式的日志收集工具，比如logstash/filebeat/syslogd等。

- Hao

2016-12-02 11:38 GMT+08:00 javacaoyu@163.com <ja...@163.com>:

> Hi eagle dev:
>     We use eagle for Cloudera CDH cluster. We use eagle with official
> website tutorial.
>     runing OK for long time .
>     But today , the kafka cluster has crash , and because the kafka crash
> lead to namenode error.
>     The standby namenode auto trans to active status and lead to hadoop
> cluster error.
>
>     We think , the send hdfs_audit log should be a single daemon , Should
> not be configuration namenode log4j file , and by namenode start to load
> kakfa jars
>
>     Because the namenode and 'send to kafka' these two in a single jvm
> daemon , that can crash namenode cause the kafka down.
>
>     We think, eagle should design a single daemon to send hdfs audit log
> to kafka,  should be decoupling not enhanced coupling.
>
>
>
>     English is not good ， you can understand is ok.
>
>     I know eagle dev team have some chinese people so you team should
> understand chinese:
>     我们通过官方文档去配置eagle，按照文档说的配置namenode的log4j配置并将eagl
> e的相关jar包放入namenode的classpath下，当重启namenode后，成功将
>     hdfs audit log 发送到kafka 并稳定运行了一段时间，
>     但是今天，kafka集群宕机了，导致namenode出现问题，datanode连接namenode出现超时，
> 备用namenode开始接管集群，但是原先的活动namenode仍然标记为活动状态，最终导致
>     hadoop集群出现问题，
>     排查问题后发现，当kafka宕机后，namenode也出现异常，并导致了namenode的问题出现。
>
>     我们建议，不应该将发送至kafka的功能绑定到namenode之中，应当将这两者解耦，
> 设计一个单独的进程去读取audit日志文件并发送至kafka
>     这样的话 当kafka宕机后 不会对namenode造成影响。
>
>     谢谢。
>
>
>
> javacaoyu@163.com
>