You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Chris Riccomini (JIRA)" <ji...@apache.org> on 2014/07/10 05:25:05 UTC

[jira] [Commented] (SAMZA-310) Publish container logs to a SystemStream

    [ https://issues.apache.org/jira/browse/SAMZA-310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14057088#comment-14057088 ] 

Chris Riccomini commented on SAMZA-310:
---------------------------------------

Could we just use [KafkaLog4jAppender.scala|https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/producer/KafkaLog4jAppender.scala] for this? We might need to extend it slightly to tag the log message with the container (or some other YARN/Samza-specific data).

One con to the log4j (or SLF4J) approach is that it doesn't give us gc logs (or any other random text file written to the logs directory in YARN), which are usually pretty useful. One counter argument to this con is that the GC info is already exposed (to some extent) through our JVM metrics class.

bq. Do we need to consider partitioning? Perhaps we can use the container name as partitioning key, so that the ordering of logs from each container is preserved.

Yea, we should do this. Without a partition key, we'd lose ordering. The container name will make sure things are always lined up properly.

bq. The serde for encoding logs into a suitable wire format should be pluggable.

Yea, the Kafka appender isn't pluggable, I think. I agree that well-structured JSON blobs for the logs is way more useful. This is, in effect, what log stash does, though. It's not that uncommon to go full [ELK|http://www.elasticsearch.org/overview/elkdownloads/], though. That said, logstash always just struck me as a Samza job waiting to happen. :) Perhaps it's better to just well-form it from the beginning, as you suggest.

Another idea: perhaps we could expose this log information through a dashboard (SAMZA-300), in some way.

An alternative implementation idea that I've been toying with is to write a little log tailing daemon. This light-weight daemon would take a config that included a set of files/directories to tail, and a set of topics to send the lines to. This could be a stand-alone daemon or a contrib to Kafka. I'd always thought of it as stand alone. For Samza, we could then install it on every node where there's an NM running, and configure it to tail all userlog directories, and forward all files to appropriate Kafka topics. This would have the advantage that we'd get gc.logs (and other logs) in our streams. It'd also be generally useful as a Flume replacement. It could tail syslog, apache logs, etc. It could also be potentially useful to ship Map/Reduce logs to Kafka, as well. I think YARN, in general, is struggling with a solid story on how to deal with app logs.

> Publish container logs to a SystemStream
> ----------------------------------------
>
>                 Key: SAMZA-310
>                 URL: https://issues.apache.org/jira/browse/SAMZA-310
>             Project: Samza
>          Issue Type: New Feature
>          Components: container
>    Affects Versions: 0.7.0
>            Reporter: Martin Kleppmann
>
> At the moment, it's a bit awkward to get to a Samza job's logs: assuming you're running on YARN, you have to navigate around the YARN web interface, and you can only see one container's logs at a time.
> Given that Samza is all about streams, it would make sense for the logs generated by Samza jobs to also be sent to a stream. There, they could be indexed with [Kibana|http://www.elasticsearch.org/overview/kibana/], consumed by an exception-tracking system, etc.
> Notes:
> - The serde for encoding logs into a suitable wire format should be pluggable. There can be a default implementation that uses JSON, analogous to MetricsSnapshotSerdeFactory for metrics, but organisations that already have a standardised in-house encoding for logs should be able to use it.
> - Should this be at the level of Slf4j or Log4j? Currently the log configuration for YARN jobs uses Log4j, which has the advantage that any frameworks/libraries that use Log4j but not Slf4j appear in the logs. However, Samza itself currently only depends on Slf4j. If we tie this feature to Log4j, it would somewhat defeat the purpose of using Slf4j.
> - Do we need to consider partitioning? Perhaps we can use the container name as partitioning key, so that the ordering of logs from each container is preserved.



--
This message was sent by Atlassian JIRA
(v6.2#6252)