You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Hanish Bansal <ha...@gmail.com> on 2014/09/22 13:21:17 UTC

Flume log4j-appender

Hi All,

I want to use flume log4j-appender for logging of a map-reduce application
which is running on different nodes. My use case is have the logs from all
nodes at centralized location(say HDFS) with time based synchronization.

As described in below links Flume has its own appender which can be used to
logging of an application in HDFS direct from log4j.

http://www.addsimplicity.com/adding_simplicity_an_engi/2010/10/sending-logs-down-the-flume.html

http://flume.apache.org/FlumeUserGuide.html#log4j-appender

Could anyone please tell me if these logs are synchronized on time-basis or
not ?

What i mean by time based synchronization is: Having logs from different
nodes in sorted order of time.

Also could anyone provide a link that describes how flume log4j-appender
internally works?


-- 
*Thanks & Regards*
*Hanish Bansal*

Re: Flume log4j-appender

Posted by Joey Echeverria <jo...@cloudera.com>.
Why do you need to remove the timestamp from the file names?

As far as creating the files on an hourly basis, it sounds like what
you want is a dataset partitioned by hour. This is very easy to do
using Kite and Flume's Kite integration. You'd create a partitioned
dataset similar to the ratings dataset shown in this example:

https://github.com/kite-sdk/kite/wiki/Hands-on-Kite-Lab-1:-Using-the-Kite-CLI-(Solution)

and then you'd configure Flume to write to the dataset similar to how
we do here:

https://github.com/kite-sdk/kite-examples/blob/snapshot/logging/flume.properties

This will write your data into a directory structure that looks like:

/flume/events/year=2014/month=09/day=24/hour=14

You can then access the dataset using Kite's view API to get just the
last hour of data:

view:hdfs:/flume/events?timestamp=($AN_HOUR_AGO,)

That would get all of the data from an hour ago until now.

Feel free to email the Kite user list (cdk-dev@cloudera.org) or my
offline if you want more help with those aspects of it.

If you wanted to do the same thing with the regular Flume HDFS Sink,
you need to either have a timestamp header in your events or set
hdfs.useLocalTimeStamp to true to use a local timestamp. And then you
need to define your path with a pattern:

agent1.sinks.hdfs-sink1.hdfs.path =
hdfs://192.168.145.127:9000/flume/events/year=%y/month=%m/day=%d/hour=%H

See this link for more details:

http://flume.apache.org/FlumeUserGuide.html#hdfs-sink

-Joey

On Wed, Sep 24, 2014 at 2:31 PM, Hanish Bansal
<ha...@gmail.com> wrote:
> Thanks for reply !!
>
> Reposting a post since it was skipped.
>
> I have configured flume and my application to use log4j-appender. As
> described in log4j-appender configuration i have configured agent properly
> (avro source, memory channel and hdfs sink).
>
> I am able to collect logs in hdfs.
>
> hdfs-sink configuration is:
>
> agent1.sinks.hdfs-sink1.type = hdfs
> agent1.sinks.hdfs-sink1.hdfs.path =
> hdfs://192.168.145.127:9000/flume/events/
> agent1.sinks.hdfs-sink1.hdfs.rollInterval=30
> agent1.sinks.hdfs-sink1.hdfs.rollCount=0
> agent1.sinks.hdfs-sink1.hdfs.rollSize=0
> agent1.sinks.hdfs-sink1.hdfs.writeFormat=Text
> agent1.sinks.hdfs-sink1.hdfs.fileType=DataStream
>
> Files created in hdfs are created on basis of hdfsPrefix and hdfsSuffix
> configuration and also includes current-timestamp in name.
>
> I am having hdfs files as below format:
> /flume/events/FlumeData.1411543838171
> /flume/events/FlumeData.1411544272696
> ....
> ....
>
> Custom file name can be specified using hdfs.filePrefix and hdfs.fileSuffix
> configuration.
>
> I want to know Could i remove timestamp from file name somehow?
>
> I am thinking to use flume log4j-appender in following way, please let me
> know if there is any issue:
>
> 1. A distributed application(say map-reduce) is running on different nodes.
> 2. Flume agent is running only on one node.
> 3.Application on all nodes will be configured to use same agent. This way
> all nodes will publish logs to same avro source.
> 4. Flume agent will write data to hdfs using hdfs sink. Files should be
> created on hour basis. One file will be created for logs of all nodes of
> last 1 hour.
>
> Thanks !!
>
> On 24/09/2014 9:57 pm, "Hari Shreedharan" <hs...@cloudera.com> wrote:
>>
>> You can’t write to the same path from more than one sink - either the
>> directories have to be different or the hdfs.filePrefix has to be different.
>> This is because HDFS actually allows only one writer per file. If multiple
>> JVMs try to write to same file, HDFS will throw an exception indicating that
>> the lease has expired to other JVMs.
>>
>> Thanks, Hari
>>
>>
>> On Wed, Sep 24, 2014 at 7:04 AM, Hanish Bansal
>> <ha...@gmail.com> wrote:
>>>
>>> One other way to use is:
>>>
>>> Run Flume agent on each node and configure hdfs sink to use same path.
>>>
>>> This way also logs of all nodes can be kept in single file (log
>>> aggregation of all nodes) in hdfs.
>>>
>>> Please let me know if there is any issue in any of use case.
>>>
>>>
>>>
>>> On Wed, Sep 24, 2014 at 7:21 PM, Hanish Bansal
>>> <ha...@gmail.com> wrote:
>>>>
>>>> Thanks Joey !!
>>>>
>>>> Great help for me.
>>>>
>>>> I have configured flume and my application to use log4j-appender. As
>>>> described in log4j-appender configuration i have configured agent properly
>>>> (avro source, memory channel and hdfs sink).
>>>>
>>>> I am able to collect logs in hdfs.
>>>>
>>>> hdfs-sink configuration is:
>>>>
>>>> agent1.sinks.hdfs-sink1.type = hdfs
>>>> agent1.sinks.hdfs-sink1.hdfs.path =
>>>> hdfs://192.168.145.127:9000/flume/events/
>>>> agent1.sinks.hdfs-sink1.hdfs.rollInterval=30
>>>> agent1.sinks.hdfs-sink1.hdfs.rollCount=0
>>>> agent1.sinks.hdfs-sink1.hdfs.rollSize=0
>>>> agent1.sinks.hdfs-sink1.hdfs.writeFormat=Text
>>>> agent1.sinks.hdfs-sink1.hdfs.fileType=DataStream
>>>>
>>>> Files created in hdfs are created on basis of hdfsPrefix and hdfsSuffix
>>>> configuration and also includes current-timestamp in name.
>>>>
>>>> I am having hdfs files as below format:
>>>> /flume/events/FlumeData.1411543838171
>>>> /flume/events/FlumeData.1411544272696
>>>> ....
>>>> ....
>>>>
>>>> Custom file name can be specified using hdfs.filePrefix and
>>>> hdfs.fileSuffix configuration.
>>>>
>>>> I want to know Could i remove timestamp from file name somehow?
>>>>
>>>> I am thinking to use flume log4j-appender in following way, please let
>>>> me know if there is any issue:
>>>>
>>>> A distributed application(say map-reduce) is running on different nodes.
>>>> Flume agent is running only on one node.
>>>> Application on all nodes will be configured to use same agent. This way
>>>> all nodes will publish logs to same avro source.
>>>> Flume agent will write data to hdfs. Files should be created on hour
>>>> basis. One file will be created for logs of all nodes of last 1 hour.
>>>>
>>>> Thanks !!
>>>>
>>>> On Tue, Sep 23, 2014 at 9:38 PM, Joey Echeverria <jo...@cloudera.com>
>>>> wrote:
>>>>>
>>>>> Inline below.
>>>>>
>>>>> On Mon, Sep 22, 2014 at 7:25 PM, Hanish Bansal
>>>>> <ha...@gmail.com> wrote:
>>>>>>
>>>>>> Thanks for reply !!
>>>>>>
>>>>>> According to architecture multiple agents will run(on each node one
>>>>>> agent) and all agents will send log data(as events) to common collector, and
>>>>>> that collector will write data to hdfs.
>>>>>>
>>>>>> Here I have some questions:
>>>>>> How collector will receive the data from agents and write to hdfs?
>>>>>
>>>>>
>>>>> Agents are typically connected by having the AvroSink of one agent
>>>>> connect to the AvroSource of the next agent in the chain.
>>>>>>
>>>>>> Does data coming from different agents is mixed up with each other
>>>>>> since collector will also write the data to hdfs in order as it arrives to
>>>>>> collector.
>>>>>
>>>>> Data will enter the collector's channel in the order they arrive. Flume
>>>>> doesn't do any re-ordering of data.
>>>>>>
>>>>>> Can we use this logging feature for real time logging? We can accept
>>>>>> this if all logs  are in same order in hdfs as these were generated from
>>>>>> application.
>>>>>
>>>>> It depends on your requirements. If you need events to be sorted by
>>>>> timestamp, then writing directly out of Flume to HDFS is not sufficient. If
>>>>> you want your data in HDFS, then the best bet is to write to a partitioned
>>>>> Dataset and run a batch job to sort partitions. For example, if you
>>>>> partitioned by hour, you'd have an hourly job to sort the last hours worth
>>>>> of data. Alternatively, you could write the data to HBase which can sort the
>>>>> data by timestamp during ingest. In either case, Kite would make it easier
>>>>> for you to integrate with the underlying store as it can write to both HDFS
>>>>> and HBase.
>>>>>
>>>>> -Joey
>>>>>
>>>>>>
>>>>>> Regards
>>>>>> Hanish
>>>>>>
>>>>>> On 22/09/2014 10:12 pm, "Joey Echeverria" <jo...@cloudera.com> wrote:
>>>>>>>
>>>>>>> Hi Hanish,
>>>>>>>
>>>>>>> The Log4jAppender is designed to connect to a Flume agent running an
>>>>>>> AvroSource. So, you'd configure Flume similar to [1] and then point
>>>>>>> the Log4jAppender to your agent using the log4j properties you linked
>>>>>>> to.
>>>>>>>
>>>>>>> The Log4jAppender will use Avro inspect the object being logged to
>>>>>>> determine it's schema and to serialize it to bytes which becomes the
>>>>>>> body of the events sent to Flume. If you're logging Strings, which is
>>>>>>> most common, then the Schema will just be a Schema.String. There are
>>>>>>> two ways that schema information can be passed. You can configure the
>>>>>>> Log4jAppender with a Schema URL that will be sent in the event
>>>>>>> headers
>>>>>>> or you can leave that out and a JSON-reperesentation of the Schema
>>>>>>> will be sent as a header with each event. The URL is more efficient
>>>>>>> as
>>>>>>> it avoids sending extra information with each record, but you can
>>>>>>> leave it out to start your testing.
>>>>>>>
>>>>>>> With regards to your second question, the answer is no. Flume does
>>>>>>> not
>>>>>>> attempt to re-order events so your logs will appear in arrival order.
>>>>>>> What I would do, is write the data to a partitioned directory
>>>>>>> structure and then have a Crunch job that sorts each partition as it
>>>>>>> closes.
>>>>>>>
>>>>>>> You might consider taking a look at the Kite SDK[2] as we have some
>>>>>>> examples that show how to do the logging[3] and can also handle
>>>>>>> getting the data properly partitioned on HDFS.
>>>>>>>
>>>>>>> HTH,
>>>>>>>
>>>>>>> -Joey
>>>>>>>
>>>>>>> [1] http://flume.apache.org/FlumeUserGuide.html#avro-source
>>>>>>> [2] http://kitesdk.org/docs/current/
>>>>>>> [3] https://github.com/kite-sdk/kite-examples/tree/snapshot/logging
>>>>>>>
>>>>>>> On Mon, Sep 22, 2014 at 4:21 AM, Hanish Bansal
>>>>>>> <ha...@gmail.com> wrote:
>>>>>>> > Hi All,
>>>>>>> >
>>>>>>> > I want to use flume log4j-appender for logging of a map-reduce
>>>>>>> > application
>>>>>>> > which is running on different nodes. My use case is have the logs
>>>>>>> > from all
>>>>>>> > nodes at centralized location(say HDFS) with time based
>>>>>>> > synchronization.
>>>>>>> >
>>>>>>> > As described in below links Flume has its own appender which can be
>>>>>>> > used to
>>>>>>> > logging of an application in HDFS direct from log4j.
>>>>>>> >
>>>>>>> >
>>>>>>> > http://www.addsimplicity.com/adding_simplicity_an_engi/2010/10/sending-logs-down-the-flume.html
>>>>>>> >
>>>>>>> >
>>>>>>> > http://flume.apache.org/FlumeUserGuide.html#log4j-appender
>>>>>>> >
>>>>>>> > Could anyone please tell me if these logs are synchronized on
>>>>>>> > time-basis or
>>>>>>> > not ?
>>>>>>> >
>>>>>>> > What i mean by time based synchronization is: Having logs from
>>>>>>> > different
>>>>>>> > nodes in sorted order of time.
>>>>>>> >
>>>>>>> > Also could anyone provide a link that describes how flume
>>>>>>> > log4j-appender
>>>>>>> > internally works?
>>>>>>> >
>>>>>>> >
>>>>>>> > --
>>>>>>> > Thanks & Regards
>>>>>>> > Hanish Bansal
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Joey Echeverria
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Joey Echeverria
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Thanks & Regards
>>>> Hanish Bansal
>>>
>>>
>>>
>>>
>>> --
>>> Thanks & Regards
>>> Hanish Bansal
>>
>>
>



-- 
Joey Echeverria

Re: Flume log4j-appender

Posted by Hanish Bansal <ha...@gmail.com>.
Thanks for reply !!

Reposting a post since it was skipped.

I have configured flume and my application to use log4j-appender. As
described in log4j-appender configuration i have configured agent properly
(avro source, memory channel and hdfs sink).

I am able to collect logs in hdfs.

hdfs-sink configuration is:

agent1.sinks.hdfs-sink1.type = hdfs
agent1.sinks.hdfs-sink1.hdfs.path = hdfs://
192.168.145.127:9000/flume/events/
agent1.sinks.hdfs-sink1.hdfs.rollInterval=30
agent1.sinks.hdfs-sink1.hdfs.rollCount=0
agent1.sinks.hdfs-sink1.hdfs.rollSize=0
agent1.sinks.hdfs-sink1.hdfs.writeFormat=Text
agent1.sinks.hdfs-sink1.hdfs.fileType=DataStream

Files created in hdfs are created on basis of hdfsPrefix and hdfsSuffix
configuration and also includes current-timestamp in name.

I am having hdfs files as below format:
/flume/events/FlumeData.1411543838171
/flume/events/FlumeData.1411544272696
....
....

Custom file name can be specified using hdfs.filePrefix and hdfs.fileSuffix
configuration.

*I want to know Could i remove timestamp from file name somehow?*

I am thinking to use flume log4j-appender in following way, please let me
know if there is any issue:

1. A distributed application(say map-reduce) is running on different nodes.
2. Flume agent is running only on one node.
3.Application on all nodes will be configured to use same agent. This way
all nodes will publish logs to same avro source.
4. Flume agent will write data to hdfs using hdfs sink. Files should be
created on hour basis. One file will be created for logs of all nodes of
last 1 hour.

Thanks !!
On 24/09/2014 9:57 pm, "Hari Shreedharan" <hs...@cloudera.com> wrote:

> You can’t write to the same path from more than one sink - either the
> directories have to be different or the hdfs.filePrefix has to be
> different. This is because HDFS actually allows only one writer per file.
> If multiple JVMs try to write to same file, HDFS will throw an exception
> indicating that the lease has expired to other JVMs.
>
> Thanks, Hari
>
>
> On Wed, Sep 24, 2014 at 7:04 AM, Hanish Bansal <
> hanish.bansal.agarwal@gmail.com> wrote:
>
>>   One other way to use is:
>>
>> Run Flume agent on each node and configure hdfs sink to use same path.
>>
>> This way also logs of all nodes can be kept in single file (log
>> aggregation of all nodes) in hdfs.
>>
>> Please let me know if there is any issue in any of use case.
>>
>>
>>
>> On Wed, Sep 24, 2014 at 7:21 PM, Hanish Bansal <
>> hanish.bansal.agarwal@gmail.com> wrote:
>>
>>>     Thanks Joey !!
>>>
>>> Great help for me.
>>>
>>> I have configured flume and my application to use log4j-appender. As
>>> described in log4j-appender configuration i have configured agent properly
>>> (avro source, memory channel and hdfs sink).
>>>
>>> I am able to collect logs in hdfs.
>>>
>>> hdfs-sink configuration is:
>>>
>>> agent1.sinks.hdfs-sink1.type = hdfs
>>> agent1.sinks.hdfs-sink1.hdfs.path = hdfs://
>>> 192.168.145.127:9000/flume/events/
>>> agent1.sinks.hdfs-sink1.hdfs.rollInterval=30
>>> agent1.sinks.hdfs-sink1.hdfs.rollCount=0
>>> agent1.sinks.hdfs-sink1.hdfs.rollSize=0
>>> agent1.sinks.hdfs-sink1.hdfs.writeFormat=Text
>>> agent1.sinks.hdfs-sink1.hdfs.fileType=DataStream
>>>
>>> Files created in hdfs are created on basis of hdfsPrefix and hdfsSuffix
>>> configuration and also includes current-timestamp in name.
>>>
>>> I am having hdfs files as below format:
>>> /flume/events/FlumeData.1411543838171
>>> /flume/events/FlumeData.1411544272696
>>> ....
>>> ....
>>>
>>> Custom file name can be specified using hdfs.filePrefix and
>>> hdfs.fileSuffix configuration.
>>>
>>> *I want to know Could i remove timestamp from file name somehow?*
>>>
>>> I am thinking to use flume log4j-appender in following way, please let
>>> me know if there is any issue:
>>>
>>>    1. A distributed application(say map-reduce) is running on different
>>>    nodes.
>>>    2. Flume agent is running only on one node.
>>>    3. Application on all nodes will be configured to use same agent.
>>>    This way all nodes will publish logs to same avro source.
>>>    4. Flume agent will write data to hdfs. Files should be created on
>>>    hour basis. One file will be created for logs of all nodes of last 1 hour.
>>>
>>> Thanks !!
>>>
>>> On Tue, Sep 23, 2014 at 9:38 PM, Joey Echeverria <jo...@cloudera.com>
>>> wrote:
>>>
>>>> Inline below.
>>>>
>>>> On Mon, Sep 22, 2014 at 7:25 PM, Hanish Bansal <
>>>> hanish.bansal.agarwal@gmail.com> wrote:
>>>>
>>>>> Thanks for reply !!
>>>>>
>>>>> According to architecture multiple agents will run(on each node one
>>>>> agent) and all agents will send log data(as events) to common collector,
>>>>> and that collector will write data to hdfs.
>>>>>
>>>>> Here I have some questions:
>>>>> How collector will receive the data from agents and write to hdfs?
>>>>>
>>>>
>>>> Agents are typically connected by having the AvroSink of one agent
>>>> connect to the AvroSource of the next agent in the chain.
>>>>
>>>>> Does data coming from different agents is mixed up with each other
>>>>> since collector will also write the data to hdfs in order as it arrives to
>>>>> collector.
>>>>>
>>>> Data will enter the collector's channel in the order they arrive. Flume
>>>> doesn't do any re-ordering of data.
>>>>
>>>>> Can we use this logging feature for real time logging? We can accept
>>>>> this if all logs  are in same order in hdfs as these were generated from
>>>>> application.
>>>>>
>>>>> It depends on your requirements. If you need events to be sorted by
>>>> timestamp, then writing directly out of Flume to HDFS is not sufficient. If
>>>> you want your data in HDFS, then the best bet is to write to a partitioned
>>>> Dataset and run a batch job to sort partitions. For example, if you
>>>> partitioned by hour, you'd have an hourly job to sort the last hours worth
>>>> of data. Alternatively, you could write the data to HBase which can sort
>>>> the data by timestamp during ingest. In either case, Kite would make it
>>>> easier for you to integrate with the underlying store as it can write to
>>>> both HDFS and HBase.
>>>>
>>>> -Joey
>>>>
>>>>
>>>>> Regards
>>>>> Hanish
>>>>>  On 22/09/2014 10:12 pm, "Joey Echeverria" <jo...@cloudera.com> wrote:
>>>>>
>>>>>> Hi Hanish,
>>>>>>
>>>>>> The Log4jAppender is designed to connect to a Flume agent running an
>>>>>> AvroSource. So, you'd configure Flume similar to [1] and then point
>>>>>> the Log4jAppender to your agent using the log4j properties you linked
>>>>>> to.
>>>>>>
>>>>>> The Log4jAppender will use Avro inspect the object being logged to
>>>>>> determine it's schema and to serialize it to bytes which becomes the
>>>>>> body of the events sent to Flume. If you're logging Strings, which is
>>>>>> most common, then the Schema will just be a Schema.String. There are
>>>>>> two ways that schema information can be passed. You can configure the
>>>>>> Log4jAppender with a Schema URL that will be sent in the event headers
>>>>>> or you can leave that out and a JSON-reperesentation of the Schema
>>>>>> will be sent as a header with each event. The URL is more efficient as
>>>>>> it avoids sending extra information with each record, but you can
>>>>>> leave it out to start your testing.
>>>>>>
>>>>>> With regards to your second question, the answer is no. Flume does not
>>>>>> attempt to re-order events so your logs will appear in arrival order.
>>>>>> What I would do, is write the data to a partitioned directory
>>>>>> structure and then have a Crunch job that sorts each partition as it
>>>>>> closes.
>>>>>>
>>>>>> You might consider taking a look at the Kite SDK[2] as we have some
>>>>>> examples that show how to do the logging[3] and can also handle
>>>>>> getting the data properly partitioned on HDFS.
>>>>>>
>>>>>> HTH,
>>>>>>
>>>>>> -Joey
>>>>>>
>>>>>> [1] http://flume.apache.org/FlumeUserGuide.html#avro-source
>>>>>> [2] http://kitesdk.org/docs/current/
>>>>>> [3] https://github.com/kite-sdk/kite-examples/tree/snapshot/logging
>>>>>>
>>>>>> On Mon, Sep 22, 2014 at 4:21 AM, Hanish Bansal
>>>>>> <ha...@gmail.com> wrote:
>>>>>> > Hi All,
>>>>>> >
>>>>>> > I want to use flume log4j-appender for logging of a map-reduce
>>>>>> application
>>>>>> > which is running on different nodes. My use case is have the logs
>>>>>> from all
>>>>>> > nodes at centralized location(say HDFS) with time based
>>>>>> synchronization.
>>>>>> >
>>>>>> > As described in below links Flume has its own appender which can be
>>>>>> used to
>>>>>> > logging of an application in HDFS direct from log4j.
>>>>>> >
>>>>>> >
>>>>>> http://www.addsimplicity.com/adding_simplicity_an_engi/2010/10/sending-logs-down-the-flume.html
>>>>>> >
>>>>>> >
>>>>>> > http://flume.apache.org/FlumeUserGuide.html#log4j-appender
>>>>>> >
>>>>>> > Could anyone please tell me if these logs are synchronized on
>>>>>> time-basis or
>>>>>> > not ?
>>>>>> >
>>>>>> > What i mean by time based synchronization is: Having logs from
>>>>>> different
>>>>>> > nodes in sorted order of time.
>>>>>> >
>>>>>> > Also could anyone provide a link that describes how flume
>>>>>> log4j-appender
>>>>>> > internally works?
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > Thanks & Regards
>>>>>> > Hanish Bansal
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Joey Echeverria
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Joey Echeverria
>>>>
>>>
>>>
>>>
>>> --
>>> *Thanks & Regards*
>>> *Hanish Bansal*
>>>
>>
>>
>>
>> --
>> *Thanks & Regards*
>> *Hanish Bansal*
>>
>
>

Re: Flume log4j-appender

Posted by Hari Shreedharan <hs...@cloudera.com>.
You can’t write to the same path from more than one sink - either the directories have to be different or the hdfs.filePrefix has to be different. This is because HDFS actually allows only one writer per file. If multiple JVMs try to write to same file, HDFS will throw an exception indicating that the lease has expired to other JVMs.



Thanks,
Hari

On Wed, Sep 24, 2014 at 7:04 AM, Hanish Bansal
<ha...@gmail.com> wrote:

> One other way to use is:
> Run Flume agent on each node and configure hdfs sink to use same path.
> This way also logs of all nodes can be kept in single file (log aggregation
> of all nodes) in hdfs.
> Please let me know if there is any issue in any of use case.
> On Wed, Sep 24, 2014 at 7:21 PM, Hanish Bansal <
> hanish.bansal.agarwal@gmail.com> wrote:
>> Thanks Joey !!
>>
>> Great help for me.
>>
>> I have configured flume and my application to use log4j-appender. As
>> described in log4j-appender configuration i have configured agent properly
>> (avro source, memory channel and hdfs sink).
>>
>> I am able to collect logs in hdfs.
>>
>> hdfs-sink configuration is:
>>
>> agent1.sinks.hdfs-sink1.type = hdfs
>> agent1.sinks.hdfs-sink1.hdfs.path = hdfs://
>> 192.168.145.127:9000/flume/events/
>> agent1.sinks.hdfs-sink1.hdfs.rollInterval=30
>> agent1.sinks.hdfs-sink1.hdfs.rollCount=0
>> agent1.sinks.hdfs-sink1.hdfs.rollSize=0
>> agent1.sinks.hdfs-sink1.hdfs.writeFormat=Text
>> agent1.sinks.hdfs-sink1.hdfs.fileType=DataStream
>>
>> Files created in hdfs are created on basis of hdfsPrefix and hdfsSuffix
>> configuration and also includes current-timestamp in name.
>>
>> I am having hdfs files as below format:
>> /flume/events/FlumeData.1411543838171
>> /flume/events/FlumeData.1411544272696
>> ....
>> ....
>>
>> Custom file name can be specified using hdfs.filePrefix and
>> hdfs.fileSuffix configuration.
>>
>> *I want to know Could i remove timestamp from file name somehow?*
>>
>> I am thinking to use flume log4j-appender in following way, please let me
>> know if there is any issue:
>>
>>    1. A distributed application(say map-reduce) is running on different
>>    nodes.
>>    2. Flume agent is running only on one node.
>>    3. Application on all nodes will be configured to use same agent. This
>>    way all nodes will publish logs to same avro source.
>>    4. Flume agent will write data to hdfs. Files should be created on
>>    hour basis. One file will be created for logs of all nodes of last 1 hour.
>>
>> Thanks !!
>>
>> On Tue, Sep 23, 2014 at 9:38 PM, Joey Echeverria <jo...@cloudera.com>
>> wrote:
>>
>>> Inline below.
>>>
>>> On Mon, Sep 22, 2014 at 7:25 PM, Hanish Bansal <
>>> hanish.bansal.agarwal@gmail.com> wrote:
>>>
>>>> Thanks for reply !!
>>>>
>>>> According to architecture multiple agents will run(on each node one
>>>> agent) and all agents will send log data(as events) to common collector,
>>>> and that collector will write data to hdfs.
>>>>
>>>> Here I have some questions:
>>>> How collector will receive the data from agents and write to hdfs?
>>>>
>>>
>>> Agents are typically connected by having the AvroSink of one agent
>>> connect to the AvroSource of the next agent in the chain.
>>>
>>>> Does data coming from different agents is mixed up with each other since
>>>> collector will also write the data to hdfs in order as it arrives to
>>>> collector.
>>>>
>>> Data will enter the collector's channel in the order they arrive. Flume
>>> doesn't do any re-ordering of data.
>>>
>>>> Can we use this logging feature for real time logging? We can accept
>>>> this if all logs  are in same order in hdfs as these were generated from
>>>> application.
>>>>
>>>> It depends on your requirements. If you need events to be sorted by
>>> timestamp, then writing directly out of Flume to HDFS is not sufficient. If
>>> you want your data in HDFS, then the best bet is to write to a partitioned
>>> Dataset and run a batch job to sort partitions. For example, if you
>>> partitioned by hour, you'd have an hourly job to sort the last hours worth
>>> of data. Alternatively, you could write the data to HBase which can sort
>>> the data by timestamp during ingest. In either case, Kite would make it
>>> easier for you to integrate with the underlying store as it can write to
>>> both HDFS and HBase.
>>>
>>> -Joey
>>>
>>>
>>>> Regards
>>>> Hanish
>>>> On 22/09/2014 10:12 pm, "Joey Echeverria" <jo...@cloudera.com> wrote:
>>>>
>>>>> Hi Hanish,
>>>>>
>>>>> The Log4jAppender is designed to connect to a Flume agent running an
>>>>> AvroSource. So, you'd configure Flume similar to [1] and then point
>>>>> the Log4jAppender to your agent using the log4j properties you linked
>>>>> to.
>>>>>
>>>>> The Log4jAppender will use Avro inspect the object being logged to
>>>>> determine it's schema and to serialize it to bytes which becomes the
>>>>> body of the events sent to Flume. If you're logging Strings, which is
>>>>> most common, then the Schema will just be a Schema.String. There are
>>>>> two ways that schema information can be passed. You can configure the
>>>>> Log4jAppender with a Schema URL that will be sent in the event headers
>>>>> or you can leave that out and a JSON-reperesentation of the Schema
>>>>> will be sent as a header with each event. The URL is more efficient as
>>>>> it avoids sending extra information with each record, but you can
>>>>> leave it out to start your testing.
>>>>>
>>>>> With regards to your second question, the answer is no. Flume does not
>>>>> attempt to re-order events so your logs will appear in arrival order.
>>>>> What I would do, is write the data to a partitioned directory
>>>>> structure and then have a Crunch job that sorts each partition as it
>>>>> closes.
>>>>>
>>>>> You might consider taking a look at the Kite SDK[2] as we have some
>>>>> examples that show how to do the logging[3] and can also handle
>>>>> getting the data properly partitioned on HDFS.
>>>>>
>>>>> HTH,
>>>>>
>>>>> -Joey
>>>>>
>>>>> [1] http://flume.apache.org/FlumeUserGuide.html#avro-source
>>>>> [2] http://kitesdk.org/docs/current/
>>>>> [3] https://github.com/kite-sdk/kite-examples/tree/snapshot/logging
>>>>>
>>>>> On Mon, Sep 22, 2014 at 4:21 AM, Hanish Bansal
>>>>> <ha...@gmail.com> wrote:
>>>>> > Hi All,
>>>>> >
>>>>> > I want to use flume log4j-appender for logging of a map-reduce
>>>>> application
>>>>> > which is running on different nodes. My use case is have the logs
>>>>> from all
>>>>> > nodes at centralized location(say HDFS) with time based
>>>>> synchronization.
>>>>> >
>>>>> > As described in below links Flume has its own appender which can be
>>>>> used to
>>>>> > logging of an application in HDFS direct from log4j.
>>>>> >
>>>>> >
>>>>> http://www.addsimplicity.com/adding_simplicity_an_engi/2010/10/sending-logs-down-the-flume.html
>>>>> >
>>>>> >
>>>>> > http://flume.apache.org/FlumeUserGuide.html#log4j-appender
>>>>> >
>>>>> > Could anyone please tell me if these logs are synchronized on
>>>>> time-basis or
>>>>> > not ?
>>>>> >
>>>>> > What i mean by time based synchronization is: Having logs from
>>>>> different
>>>>> > nodes in sorted order of time.
>>>>> >
>>>>> > Also could anyone provide a link that describes how flume
>>>>> log4j-appender
>>>>> > internally works?
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Thanks & Regards
>>>>> > Hanish Bansal
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Joey Echeverria
>>>>>
>>>>
>>>
>>>
>>> --
>>> Joey Echeverria
>>>
>>
>>
>>
>> --
>> *Thanks & Regards*
>> *Hanish Bansal*
>>
> -- 
> *Thanks & Regards*
> *Hanish Bansal*

Re: Flume log4j-appender

Posted by Hanish Bansal <ha...@gmail.com>.
One other way to use is:

Run Flume agent on each node and configure hdfs sink to use same path.

This way also logs of all nodes can be kept in single file (log aggregation
of all nodes) in hdfs.

Please let me know if there is any issue in any of use case.



On Wed, Sep 24, 2014 at 7:21 PM, Hanish Bansal <
hanish.bansal.agarwal@gmail.com> wrote:

> Thanks Joey !!
>
> Great help for me.
>
> I have configured flume and my application to use log4j-appender. As
> described in log4j-appender configuration i have configured agent properly
> (avro source, memory channel and hdfs sink).
>
> I am able to collect logs in hdfs.
>
> hdfs-sink configuration is:
>
> agent1.sinks.hdfs-sink1.type = hdfs
> agent1.sinks.hdfs-sink1.hdfs.path = hdfs://
> 192.168.145.127:9000/flume/events/
> agent1.sinks.hdfs-sink1.hdfs.rollInterval=30
> agent1.sinks.hdfs-sink1.hdfs.rollCount=0
> agent1.sinks.hdfs-sink1.hdfs.rollSize=0
> agent1.sinks.hdfs-sink1.hdfs.writeFormat=Text
> agent1.sinks.hdfs-sink1.hdfs.fileType=DataStream
>
> Files created in hdfs are created on basis of hdfsPrefix and hdfsSuffix
> configuration and also includes current-timestamp in name.
>
> I am having hdfs files as below format:
> /flume/events/FlumeData.1411543838171
> /flume/events/FlumeData.1411544272696
> ....
> ....
>
> Custom file name can be specified using hdfs.filePrefix and
> hdfs.fileSuffix configuration.
>
> *I want to know Could i remove timestamp from file name somehow?*
>
> I am thinking to use flume log4j-appender in following way, please let me
> know if there is any issue:
>
>    1. A distributed application(say map-reduce) is running on different
>    nodes.
>    2. Flume agent is running only on one node.
>    3. Application on all nodes will be configured to use same agent. This
>    way all nodes will publish logs to same avro source.
>    4. Flume agent will write data to hdfs. Files should be created on
>    hour basis. One file will be created for logs of all nodes of last 1 hour.
>
> Thanks !!
>
> On Tue, Sep 23, 2014 at 9:38 PM, Joey Echeverria <jo...@cloudera.com>
> wrote:
>
>> Inline below.
>>
>> On Mon, Sep 22, 2014 at 7:25 PM, Hanish Bansal <
>> hanish.bansal.agarwal@gmail.com> wrote:
>>
>>> Thanks for reply !!
>>>
>>> According to architecture multiple agents will run(on each node one
>>> agent) and all agents will send log data(as events) to common collector,
>>> and that collector will write data to hdfs.
>>>
>>> Here I have some questions:
>>> How collector will receive the data from agents and write to hdfs?
>>>
>>
>> Agents are typically connected by having the AvroSink of one agent
>> connect to the AvroSource of the next agent in the chain.
>>
>>> Does data coming from different agents is mixed up with each other since
>>> collector will also write the data to hdfs in order as it arrives to
>>> collector.
>>>
>> Data will enter the collector's channel in the order they arrive. Flume
>> doesn't do any re-ordering of data.
>>
>>> Can we use this logging feature for real time logging? We can accept
>>> this if all logs  are in same order in hdfs as these were generated from
>>> application.
>>>
>>> It depends on your requirements. If you need events to be sorted by
>> timestamp, then writing directly out of Flume to HDFS is not sufficient. If
>> you want your data in HDFS, then the best bet is to write to a partitioned
>> Dataset and run a batch job to sort partitions. For example, if you
>> partitioned by hour, you'd have an hourly job to sort the last hours worth
>> of data. Alternatively, you could write the data to HBase which can sort
>> the data by timestamp during ingest. In either case, Kite would make it
>> easier for you to integrate with the underlying store as it can write to
>> both HDFS and HBase.
>>
>> -Joey
>>
>>
>>> Regards
>>> Hanish
>>> On 22/09/2014 10:12 pm, "Joey Echeverria" <jo...@cloudera.com> wrote:
>>>
>>>> Hi Hanish,
>>>>
>>>> The Log4jAppender is designed to connect to a Flume agent running an
>>>> AvroSource. So, you'd configure Flume similar to [1] and then point
>>>> the Log4jAppender to your agent using the log4j properties you linked
>>>> to.
>>>>
>>>> The Log4jAppender will use Avro inspect the object being logged to
>>>> determine it's schema and to serialize it to bytes which becomes the
>>>> body of the events sent to Flume. If you're logging Strings, which is
>>>> most common, then the Schema will just be a Schema.String. There are
>>>> two ways that schema information can be passed. You can configure the
>>>> Log4jAppender with a Schema URL that will be sent in the event headers
>>>> or you can leave that out and a JSON-reperesentation of the Schema
>>>> will be sent as a header with each event. The URL is more efficient as
>>>> it avoids sending extra information with each record, but you can
>>>> leave it out to start your testing.
>>>>
>>>> With regards to your second question, the answer is no. Flume does not
>>>> attempt to re-order events so your logs will appear in arrival order.
>>>> What I would do, is write the data to a partitioned directory
>>>> structure and then have a Crunch job that sorts each partition as it
>>>> closes.
>>>>
>>>> You might consider taking a look at the Kite SDK[2] as we have some
>>>> examples that show how to do the logging[3] and can also handle
>>>> getting the data properly partitioned on HDFS.
>>>>
>>>> HTH,
>>>>
>>>> -Joey
>>>>
>>>> [1] http://flume.apache.org/FlumeUserGuide.html#avro-source
>>>> [2] http://kitesdk.org/docs/current/
>>>> [3] https://github.com/kite-sdk/kite-examples/tree/snapshot/logging
>>>>
>>>> On Mon, Sep 22, 2014 at 4:21 AM, Hanish Bansal
>>>> <ha...@gmail.com> wrote:
>>>> > Hi All,
>>>> >
>>>> > I want to use flume log4j-appender for logging of a map-reduce
>>>> application
>>>> > which is running on different nodes. My use case is have the logs
>>>> from all
>>>> > nodes at centralized location(say HDFS) with time based
>>>> synchronization.
>>>> >
>>>> > As described in below links Flume has its own appender which can be
>>>> used to
>>>> > logging of an application in HDFS direct from log4j.
>>>> >
>>>> >
>>>> http://www.addsimplicity.com/adding_simplicity_an_engi/2010/10/sending-logs-down-the-flume.html
>>>> >
>>>> >
>>>> > http://flume.apache.org/FlumeUserGuide.html#log4j-appender
>>>> >
>>>> > Could anyone please tell me if these logs are synchronized on
>>>> time-basis or
>>>> > not ?
>>>> >
>>>> > What i mean by time based synchronization is: Having logs from
>>>> different
>>>> > nodes in sorted order of time.
>>>> >
>>>> > Also could anyone provide a link that describes how flume
>>>> log4j-appender
>>>> > internally works?
>>>> >
>>>> >
>>>> > --
>>>> > Thanks & Regards
>>>> > Hanish Bansal
>>>>
>>>>
>>>>
>>>> --
>>>> Joey Echeverria
>>>>
>>>
>>
>>
>> --
>> Joey Echeverria
>>
>
>
>
> --
> *Thanks & Regards*
> *Hanish Bansal*
>



-- 
*Thanks & Regards*
*Hanish Bansal*

Re: Flume log4j-appender

Posted by Hanish Bansal <ha...@gmail.com>.
Thanks Joey !!

Great help for me.

I have configured flume and my application to use log4j-appender. As
described in log4j-appender configuration i have configured agent properly
(avro source, memory channel and hdfs sink).

I am able to collect logs in hdfs.

hdfs-sink configuration is:

agent1.sinks.hdfs-sink1.type = hdfs
agent1.sinks.hdfs-sink1.hdfs.path = hdfs://
192.168.145.127:9000/flume/events/
agent1.sinks.hdfs-sink1.hdfs.rollInterval=30
agent1.sinks.hdfs-sink1.hdfs.rollCount=0
agent1.sinks.hdfs-sink1.hdfs.rollSize=0
agent1.sinks.hdfs-sink1.hdfs.writeFormat=Text
agent1.sinks.hdfs-sink1.hdfs.fileType=DataStream

Files created in hdfs are created on basis of hdfsPrefix and hdfsSuffix
configuration and also includes current-timestamp in name.

I am having hdfs files as below format:
/flume/events/FlumeData.1411543838171
/flume/events/FlumeData.1411544272696
....
....

Custom file name can be specified using hdfs.filePrefix and hdfs.fileSuffix
configuration.

*I want to know Could i remove timestamp from file name somehow?*

I am thinking to use flume log4j-appender in following way, please let me
know if there is any issue:

   1. A distributed application(say map-reduce) is running on different
   nodes.
   2. Flume agent is running only on one node.
   3. Application on all nodes will be configured to use same agent. This
   way all nodes will publish logs to same avro source.
   4. Flume agent will write data to hdfs. Files should be created on hour
   basis. One file will be created for logs of all nodes of last 1 hour.

Thanks !!

On Tue, Sep 23, 2014 at 9:38 PM, Joey Echeverria <jo...@cloudera.com> wrote:

> Inline below.
>
> On Mon, Sep 22, 2014 at 7:25 PM, Hanish Bansal <
> hanish.bansal.agarwal@gmail.com> wrote:
>
>> Thanks for reply !!
>>
>> According to architecture multiple agents will run(on each node one
>> agent) and all agents will send log data(as events) to common collector,
>> and that collector will write data to hdfs.
>>
>> Here I have some questions:
>> How collector will receive the data from agents and write to hdfs?
>>
>
> Agents are typically connected by having the AvroSink of one agent connect
> to the AvroSource of the next agent in the chain.
>
>> Does data coming from different agents is mixed up with each other since
>> collector will also write the data to hdfs in order as it arrives to
>> collector.
>>
> Data will enter the collector's channel in the order they arrive. Flume
> doesn't do any re-ordering of data.
>
>> Can we use this logging feature for real time logging? We can accept this
>> if all logs  are in same order in hdfs as these were generated from
>> application.
>>
>> It depends on your requirements. If you need events to be sorted by
> timestamp, then writing directly out of Flume to HDFS is not sufficient. If
> you want your data in HDFS, then the best bet is to write to a partitioned
> Dataset and run a batch job to sort partitions. For example, if you
> partitioned by hour, you'd have an hourly job to sort the last hours worth
> of data. Alternatively, you could write the data to HBase which can sort
> the data by timestamp during ingest. In either case, Kite would make it
> easier for you to integrate with the underlying store as it can write to
> both HDFS and HBase.
>
> -Joey
>
>
>> Regards
>> Hanish
>> On 22/09/2014 10:12 pm, "Joey Echeverria" <jo...@cloudera.com> wrote:
>>
>>> Hi Hanish,
>>>
>>> The Log4jAppender is designed to connect to a Flume agent running an
>>> AvroSource. So, you'd configure Flume similar to [1] and then point
>>> the Log4jAppender to your agent using the log4j properties you linked
>>> to.
>>>
>>> The Log4jAppender will use Avro inspect the object being logged to
>>> determine it's schema and to serialize it to bytes which becomes the
>>> body of the events sent to Flume. If you're logging Strings, which is
>>> most common, then the Schema will just be a Schema.String. There are
>>> two ways that schema information can be passed. You can configure the
>>> Log4jAppender with a Schema URL that will be sent in the event headers
>>> or you can leave that out and a JSON-reperesentation of the Schema
>>> will be sent as a header with each event. The URL is more efficient as
>>> it avoids sending extra information with each record, but you can
>>> leave it out to start your testing.
>>>
>>> With regards to your second question, the answer is no. Flume does not
>>> attempt to re-order events so your logs will appear in arrival order.
>>> What I would do, is write the data to a partitioned directory
>>> structure and then have a Crunch job that sorts each partition as it
>>> closes.
>>>
>>> You might consider taking a look at the Kite SDK[2] as we have some
>>> examples that show how to do the logging[3] and can also handle
>>> getting the data properly partitioned on HDFS.
>>>
>>> HTH,
>>>
>>> -Joey
>>>
>>> [1] http://flume.apache.org/FlumeUserGuide.html#avro-source
>>> [2] http://kitesdk.org/docs/current/
>>> [3] https://github.com/kite-sdk/kite-examples/tree/snapshot/logging
>>>
>>> On Mon, Sep 22, 2014 at 4:21 AM, Hanish Bansal
>>> <ha...@gmail.com> wrote:
>>> > Hi All,
>>> >
>>> > I want to use flume log4j-appender for logging of a map-reduce
>>> application
>>> > which is running on different nodes. My use case is have the logs from
>>> all
>>> > nodes at centralized location(say HDFS) with time based
>>> synchronization.
>>> >
>>> > As described in below links Flume has its own appender which can be
>>> used to
>>> > logging of an application in HDFS direct from log4j.
>>> >
>>> >
>>> http://www.addsimplicity.com/adding_simplicity_an_engi/2010/10/sending-logs-down-the-flume.html
>>> >
>>> >
>>> > http://flume.apache.org/FlumeUserGuide.html#log4j-appender
>>> >
>>> > Could anyone please tell me if these logs are synchronized on
>>> time-basis or
>>> > not ?
>>> >
>>> > What i mean by time based synchronization is: Having logs from
>>> different
>>> > nodes in sorted order of time.
>>> >
>>> > Also could anyone provide a link that describes how flume
>>> log4j-appender
>>> > internally works?
>>> >
>>> >
>>> > --
>>> > Thanks & Regards
>>> > Hanish Bansal
>>>
>>>
>>>
>>> --
>>> Joey Echeverria
>>>
>>
>
>
> --
> Joey Echeverria
>



-- 
*Thanks & Regards*
*Hanish Bansal*

Re: Flume log4j-appender

Posted by Joey Echeverria <jo...@cloudera.com>.
Inline below.

On Mon, Sep 22, 2014 at 7:25 PM, Hanish Bansal <
hanish.bansal.agarwal@gmail.com> wrote:

> Thanks for reply !!
>
> According to architecture multiple agents will run(on each node one agent)
> and all agents will send log data(as events) to common collector, and that
> collector will write data to hdfs.
>
> Here I have some questions:
> How collector will receive the data from agents and write to hdfs?
>

Agents are typically connected by having the AvroSink of one agent connect
to the AvroSource of the next agent in the chain.

> Does data coming from different agents is mixed up with each other since
> collector will also write the data to hdfs in order as it arrives to
> collector.
>
Data will enter the collector's channel in the order they arrive. Flume
doesn't do any re-ordering of data.

> Can we use this logging feature for real time logging? We can accept this
> if all logs  are in same order in hdfs as these were generated from
> application.
>
> It depends on your requirements. If you need events to be sorted by
timestamp, then writing directly out of Flume to HDFS is not sufficient. If
you want your data in HDFS, then the best bet is to write to a partitioned
Dataset and run a batch job to sort partitions. For example, if you
partitioned by hour, you'd have an hourly job to sort the last hours worth
of data. Alternatively, you could write the data to HBase which can sort
the data by timestamp during ingest. In either case, Kite would make it
easier for you to integrate with the underlying store as it can write to
both HDFS and HBase.

-Joey


> Regards
> Hanish
> On 22/09/2014 10:12 pm, "Joey Echeverria" <jo...@cloudera.com> wrote:
>
>> Hi Hanish,
>>
>> The Log4jAppender is designed to connect to a Flume agent running an
>> AvroSource. So, you'd configure Flume similar to [1] and then point
>> the Log4jAppender to your agent using the log4j properties you linked
>> to.
>>
>> The Log4jAppender will use Avro inspect the object being logged to
>> determine it's schema and to serialize it to bytes which becomes the
>> body of the events sent to Flume. If you're logging Strings, which is
>> most common, then the Schema will just be a Schema.String. There are
>> two ways that schema information can be passed. You can configure the
>> Log4jAppender with a Schema URL that will be sent in the event headers
>> or you can leave that out and a JSON-reperesentation of the Schema
>> will be sent as a header with each event. The URL is more efficient as
>> it avoids sending extra information with each record, but you can
>> leave it out to start your testing.
>>
>> With regards to your second question, the answer is no. Flume does not
>> attempt to re-order events so your logs will appear in arrival order.
>> What I would do, is write the data to a partitioned directory
>> structure and then have a Crunch job that sorts each partition as it
>> closes.
>>
>> You might consider taking a look at the Kite SDK[2] as we have some
>> examples that show how to do the logging[3] and can also handle
>> getting the data properly partitioned on HDFS.
>>
>> HTH,
>>
>> -Joey
>>
>> [1] http://flume.apache.org/FlumeUserGuide.html#avro-source
>> [2] http://kitesdk.org/docs/current/
>> [3] https://github.com/kite-sdk/kite-examples/tree/snapshot/logging
>>
>> On Mon, Sep 22, 2014 at 4:21 AM, Hanish Bansal
>> <ha...@gmail.com> wrote:
>> > Hi All,
>> >
>> > I want to use flume log4j-appender for logging of a map-reduce
>> application
>> > which is running on different nodes. My use case is have the logs from
>> all
>> > nodes at centralized location(say HDFS) with time based synchronization.
>> >
>> > As described in below links Flume has its own appender which can be
>> used to
>> > logging of an application in HDFS direct from log4j.
>> >
>> >
>> http://www.addsimplicity.com/adding_simplicity_an_engi/2010/10/sending-logs-down-the-flume.html
>> >
>> >
>> > http://flume.apache.org/FlumeUserGuide.html#log4j-appender
>> >
>> > Could anyone please tell me if these logs are synchronized on
>> time-basis or
>> > not ?
>> >
>> > What i mean by time based synchronization is: Having logs from different
>> > nodes in sorted order of time.
>> >
>> > Also could anyone provide a link that describes how flume log4j-appender
>> > internally works?
>> >
>> >
>> > --
>> > Thanks & Regards
>> > Hanish Bansal
>>
>>
>>
>> --
>> Joey Echeverria
>>
>


-- 
Joey Echeverria

Re: Flume log4j-appender

Posted by Hanish Bansal <ha...@gmail.com>.
Thanks for reply !!

According to architecture multiple agents will run(on each node one agent)
and all agents will send log data(as events) to common collector, and that
collector will write data to hdfs.

Here I have some questions:
How collector will receive the data from agents and write to hdfs?

Does data coming from different agents is mixed up with each other since
collector will also write the data to hdfs in order as it arrives to
collector.

Can we use this logging feature for real time logging? We can accept this
if all logs  are in same order in hdfs as these were generated from
application.

Regards
Hanish
On 22/09/2014 10:12 pm, "Joey Echeverria" <jo...@cloudera.com> wrote:

> Hi Hanish,
>
> The Log4jAppender is designed to connect to a Flume agent running an
> AvroSource. So, you'd configure Flume similar to [1] and then point
> the Log4jAppender to your agent using the log4j properties you linked
> to.
>
> The Log4jAppender will use Avro inspect the object being logged to
> determine it's schema and to serialize it to bytes which becomes the
> body of the events sent to Flume. If you're logging Strings, which is
> most common, then the Schema will just be a Schema.String. There are
> two ways that schema information can be passed. You can configure the
> Log4jAppender with a Schema URL that will be sent in the event headers
> or you can leave that out and a JSON-reperesentation of the Schema
> will be sent as a header with each event. The URL is more efficient as
> it avoids sending extra information with each record, but you can
> leave it out to start your testing.
>
> With regards to your second question, the answer is no. Flume does not
> attempt to re-order events so your logs will appear in arrival order.
> What I would do, is write the data to a partitioned directory
> structure and then have a Crunch job that sorts each partition as it
> closes.
>
> You might consider taking a look at the Kite SDK[2] as we have some
> examples that show how to do the logging[3] and can also handle
> getting the data properly partitioned on HDFS.
>
> HTH,
>
> -Joey
>
> [1] http://flume.apache.org/FlumeUserGuide.html#avro-source
> [2] http://kitesdk.org/docs/current/
> [3] https://github.com/kite-sdk/kite-examples/tree/snapshot/logging
>
> On Mon, Sep 22, 2014 at 4:21 AM, Hanish Bansal
> <ha...@gmail.com> wrote:
> > Hi All,
> >
> > I want to use flume log4j-appender for logging of a map-reduce
> application
> > which is running on different nodes. My use case is have the logs from
> all
> > nodes at centralized location(say HDFS) with time based synchronization.
> >
> > As described in below links Flume has its own appender which can be used
> to
> > logging of an application in HDFS direct from log4j.
> >
> >
> http://www.addsimplicity.com/adding_simplicity_an_engi/2010/10/sending-logs-down-the-flume.html
> >
> >
> > http://flume.apache.org/FlumeUserGuide.html#log4j-appender
> >
> > Could anyone please tell me if these logs are synchronized on time-basis
> or
> > not ?
> >
> > What i mean by time based synchronization is: Having logs from different
> > nodes in sorted order of time.
> >
> > Also could anyone provide a link that describes how flume log4j-appender
> > internally works?
> >
> >
> > --
> > Thanks & Regards
> > Hanish Bansal
>
>
>
> --
> Joey Echeverria
>

Re: Flume log4j-appender

Posted by Joey Echeverria <jo...@cloudera.com>.
Hi Hanish,

The Log4jAppender is designed to connect to a Flume agent running an
AvroSource. So, you'd configure Flume similar to [1] and then point
the Log4jAppender to your agent using the log4j properties you linked
to.

The Log4jAppender will use Avro inspect the object being logged to
determine it's schema and to serialize it to bytes which becomes the
body of the events sent to Flume. If you're logging Strings, which is
most common, then the Schema will just be a Schema.String. There are
two ways that schema information can be passed. You can configure the
Log4jAppender with a Schema URL that will be sent in the event headers
or you can leave that out and a JSON-reperesentation of the Schema
will be sent as a header with each event. The URL is more efficient as
it avoids sending extra information with each record, but you can
leave it out to start your testing.

With regards to your second question, the answer is no. Flume does not
attempt to re-order events so your logs will appear in arrival order.
What I would do, is write the data to a partitioned directory
structure and then have a Crunch job that sorts each partition as it
closes.

You might consider taking a look at the Kite SDK[2] as we have some
examples that show how to do the logging[3] and can also handle
getting the data properly partitioned on HDFS.

HTH,

-Joey

[1] http://flume.apache.org/FlumeUserGuide.html#avro-source
[2] http://kitesdk.org/docs/current/
[3] https://github.com/kite-sdk/kite-examples/tree/snapshot/logging

On Mon, Sep 22, 2014 at 4:21 AM, Hanish Bansal
<ha...@gmail.com> wrote:
> Hi All,
>
> I want to use flume log4j-appender for logging of a map-reduce application
> which is running on different nodes. My use case is have the logs from all
> nodes at centralized location(say HDFS) with time based synchronization.
>
> As described in below links Flume has its own appender which can be used to
> logging of an application in HDFS direct from log4j.
>
> http://www.addsimplicity.com/adding_simplicity_an_engi/2010/10/sending-logs-down-the-flume.html
>
>
> http://flume.apache.org/FlumeUserGuide.html#log4j-appender
>
> Could anyone please tell me if these logs are synchronized on time-basis or
> not ?
>
> What i mean by time based synchronization is: Having logs from different
> nodes in sorted order of time.
>
> Also could anyone provide a link that describes how flume log4j-appender
> internally works?
>
>
> --
> Thanks & Regards
> Hanish Bansal



-- 
Joey Echeverria