You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Srinivasan Subramanian <ss...@hotmail.com> on 2011/11/30 09:23:32 UTC

Log4J appender

I was evaluating the log4j appender provided with Flume.  But there is one aspect I dont understand:
The log4j appender makes a connection to teh flume-agent and retries a maximum of 10 times (default - configurable) if the connection is not made successfully.  
Questions:
1. When will the connection fail?  If the agent is not running on the node?  In that case given that the default implementation waits for 1 second before each retry for a total of 10 retries, would this mean that each logging call from the application would be delayed by 10 seconds?  That would affect performance right?  
2. What happens to the log message when the agent is not available?  Is it lost?
I am a little confused with the implementation and any help in explaining this is appreciated.
RegardsSrini

Flume test setup with Log4J appender seeing lots of duplicate data

Posted by Srinivasan Subramanian <ss...@hotmail.com>.

Hi
I just followed the User guide to setup a test flume.  
I have a master I started with: flume master (command line)Collector: flume node_nowatch -n collector (command line)Node: flume node_nowatch (command line)
I can now check the flume master config at localhost:35871.  I can see the collector and agent nodes also connected.
The settings I have for source and sink are as follows:
Collector	default-flow	collectorSource(35853)	collectorSink("file:///var/log/collected/%Y-%m-%d/", "%{customer}-", 3600000)int-logger	default-flow	avroSource(12345)	{split("~", 1, "customer") => agentSink("localhost",35853)}
I have a java test program with the Avro Log4J appender and ran a test application queuing a simple message like "~somename~Test message <count>" in a loop of 20 times.  The flume log agent uses the avrosource and splits based on a regular expression and splits the token somename to a Flume attribute "customer" and sends it further.  The collector sink then writes to a directory  based on date and further on file name based on somename and timestamp.  So far so good as all log messages are collected and processed properly.
However i am seeing huge amounts of duplication of data.  I guess something is wrong with my setup/settings but i am unable to fathom whats wrong.  Any support on getting this right is appreciated.
ThanksSrini

Re: Log4J appender

Posted by Eric Sammer <es...@cloudera.com>.

On Wed, Nov 30, 2011 at 8:36 PM, Srinivasan Subramanian <
ssrini_vasan@hotmail.com> wrote:

>  Hi Eric
>
> Thanks for that.  I will look at integrating Log4J appender for flume for
> sure.  Couple of additional questions.
>
> 1. From a performance standpoint, does Log4J appender have any significant
> advantages over tailing the log file?
>

The log4j appender should be more reliable and safer to use than tail as it
communicates directly via RPC with well defined semantics. The tail source
has some issues with race conditions around quickly truncated files and
failure recovery. With respect to performance, they're probably close but
it's hard to say. Tail requires disk IO which can be slow but the log4j
appender uses Avro rpc which isn't blazing fast either.


> 2. It would be ideal if the Log4J appender also allows to put in some meta
> data that I need to use for output bucketing.  Any ideas how that can be
> achieved?
>

I don't believe there's any way to inject metadata into the event generated
by the appender. Someone did some work to make the log4j appender
understand the MDC / NDC stuff (that I know very little about) but I never
had time to review / integrate the patch, sadly. You should just take a
look at the appender source; it's really simple.


>
> Regards
> Srini
>
>
>
>
> ------------------------------
> Date: Wed, 30 Nov 2011 10:28:41 -0800
> Subject: Re: Log4J appender
> From: esammer@cloudera.com
> To: flume-user@incubator.apache.org
>
>
> Srini:
>
> On Wed, Nov 30, 2011 at 12:23 AM, Srinivasan Subramanian <
> ssrini_vasan@hotmail.com> wrote:
>
>  I was evaluating the log4j appender provided with Flume.  But there is
> one aspect I dont understand:
>
> The log4j appender makes a connection to teh flume-agent and retries a
> maximum of 10 times (default - configurable) if the connection is not made
> successfully.
>
> Questions:
>
> 1. When will the connection fail?  If the agent is not running on the
> node?  In that case given that the default implementation waits for 1
> second before each retry for a total of 10 retries, would this mean that
> each logging call from the application would be delayed by 10 seconds?
>  That would affect performance right?
>
>
> Almost certainly, yes, assuming log4j is synchronous (I'm 99.9% sure it
> is). Of course, synchronous logging is the only way to guarantee event
> delivery in this context; if the application were to log the event and move
> on without waiting for a response an event could get dropped and no one
> would be responsible for retrying the send.
>
>
>
> 2. What happens to the log message when the agent is not available?  Is it
> lost?
>
>
> If the log4j appender runs out of retries I believe I wrote it to throw an
> exception. This would be the equivalent of using a standard file appender
> and running out of disk space. In other words, the log call failed and
> should be handled by the application.
>
> Let me know if you have any other questions!
>
>
> I am a little confused with the implementation and any help in explaining
> this is appreciated.
>
> Regards
> Srini
>
>
>
>
>
> --
> Eric Sammer
> twitter: esammer
> data: www.cloudera.com
>



-- 
Eric Sammer
twitter: esammer
data: www.cloudera.com

RE: Log4J appender

Posted by Srinivasan Subramanian <ss...@hotmail.com>.

Hi Eric
Thanks for that.  I will look at integrating Log4J appender for flume for sure.  Couple of additional questions.
1. From a performance standpoint, does Log4J appender have any significant advantages over tailing the log file?2. It would be ideal if the Log4J appender also allows to put in some meta data that I need to use for output bucketing.  Any ideas how that can be achieved?
RegardsSrini

Date: Wed, 30 Nov 2011 10:28:41 -0800
Subject: Re: Log4J appender
From: esammer@cloudera.com
To: flume-user@incubator.apache.org

Srini:
On Wed, Nov 30, 2011 at 12:23 AM, Srinivasan Subramanian <ss...@hotmail.com> wrote:

I was evaluating the log4j appender provided with Flume.  But there is one aspect I dont understand:
The log4j appender makes a connection to teh flume-agent and retries a maximum of 10 times (default - configurable) if the connection is not made successfully.  

Questions:
1. When will the connection fail?  If the agent is not running on the node?  In that case given that the default implementation waits for 1 second before each retry for a total of 10 retries, would this mean that each logging call from the application would be delayed by 10 seconds?  That would affect performance right?  

Almost certainly, yes, assuming log4j is synchronous (I'm 99.9% sure it is). Of course, synchronous logging is the only way to guarantee event delivery in this context; if the application were to log the event and move on without waiting for a response an event could get dropped and no one would be responsible for retrying the send.

2. What happens to the log message when the agent is not available?  Is it lost?

If the log4j appender runs out of retries I believe I wrote it to throw an exception. This would be the equivalent of using a standard file appender and running out of disk space. In other words, the log call failed and should be handled by the application.

Let me know if you have any other questions!

I am a little confused with the implementation and any help in explaining this is appreciated.
RegardsSrini

-- 
Eric Sammer
twitter: esammer
data: www.cloudera.com

Re: Log4J appender

Posted by Eric Sammer <es...@cloudera.com>.

Srini:

On Wed, Nov 30, 2011 at 12:23 AM, Srinivasan Subramanian <
ssrini_vasan@hotmail.com> wrote:

>  I was evaluating the log4j appender provided with Flume.  But there is
> one aspect I dont understand:
>
> The log4j appender makes a connection to teh flume-agent and retries a
> maximum of 10 times (default - configurable) if the connection is not made
> successfully.
>
> Questions:
>
> 1. When will the connection fail?  If the agent is not running on the
> node?  In that case given that the default implementation waits for 1
> second before each retry for a total of 10 retries, would this mean that
> each logging call from the application would be delayed by 10 seconds?
>  That would affect performance right?
>

Almost certainly, yes, assuming log4j is synchronous (I'm 99.9% sure it
is). Of course, synchronous logging is the only way to guarantee event
delivery in this context; if the application were to log the event and move
on without waiting for a response an event could get dropped and no one
would be responsible for retrying the send.

>
> 2. What happens to the log message when the agent is not available?  Is it
> lost?
>

If the log4j appender runs out of retries I believe I wrote it to throw an
exception. This would be the equivalent of using a standard file appender
and running out of disk space. In other words, the log call failed and
should be handled by the application.

Let me know if you have any other questions!

> I am a little confused with the implementation and any help in explaining
> this is appreciated.
>
> Regards
> Srini
>
>
>

-- 
Eric Sammer
twitter: esammer
data: www.cloudera.com