You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Emile Kao <em...@gmx.net> on 2012/12/04 11:04:19 UTC

A customer use case

Hello guys,
now that I have successfuly setup a running Flume / Hadoop system for my customer, I would like to ask for a help in trying to implement a requirement requested by the customer:

Here is how the use case is looking like:

1. Customer has many Apache Web server and WebSphere Application server that produce many logs.

2. Customer wants to provide the logs to the developer team without giving them direct access to the machines hosting the logs.

3. The idea is now to collect all the log files and put them together in one place and let the developer team get access to them through a web interface.

4. My goal is to resolve this problem using Flume / Hadoop

Questions:

1. Which is the best way to implement such a scenario using Flume/ Hadoop?

2. The customer would like to keep the log files in thier original state (file name, size, etc..). Is it practicable using Flume?

3. Is there a better way to collect the files without using "Exec source" and "tail -F" command?

Many Thanks and Cheers,
Emile

Re: A customer use case / using spoolDir

Posted by Patrick Wendell <pw...@gmail.com>.

To answer your other questions: The spooling source will pick up files
in the directory, send them with Flume, and rename them to indicate
that they have been transferred. Files that were already in the
directory before you started will be read and sent through Flume. It
treats these like any other files.

- Patrick

On Wed, Dec 5, 2012 at 4:34 AM, Alexander Alten-Lorenz
<wg...@gmail.com> wrote:
> Hi,
>
> as the error message says:
>> No Channels configured for spooldir-1
>
> add:
> agent1.sources.spooldir-1.channels = MemoryChannel-2
>
> When a file is dropped into the source should pick up them. If are files inside they will be processed (if I'm not totally wrong)
>
> - Alex
>
>
> On Dec 5, 2012, at 1:00 PM, Emile Kao <em...@gmx.net> wrote:
>
>> Hello,
>> thank you for the hint to use the new spoolDir feature in the fresh released 1.3.0 version of Flume.
>>
>> unfortunately I am not getting the expected result.
>> Here is my configuration:
>>
>> agent1.channels = MemoryChannel-2
>> agent1.channels.MemoryChannel-2.type = memory
>>
>> agent1.sources = spooldir-1
>> agent1.sources.spooldir-1.type = spooldir
>> agent1.sources.spooldir-1.spoolDir = /opt/apache2/logs/flumeSpool
>> agent1.sources.spooldir-1.fileHeader = true
>>
>> agent1.sinks = HDFS
>> agent1.sinks.HDFS.channel = MemoryChannel-2
>> agent1.sinks.HDFS.type = hdfs
>> agent1.sinks.HDFS.hdfs.fileType = DataStream
>> agent1.sinks.HDFS.hdfs.path = hdfs://localhost:9000
>> agent1.sinks.HDFS.hdfs.writeFormat = Text
>>
>>
>> Upon start I am getting the following warning:
>> 2012-12-05 11:05:19,216 (conf-file-poller-0) [WARN - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.validateSources(FlumeConfiguration.java:571)] Removed spooldir-1 due to No Channels configured for spooldir-1
>>
>> Question:
>>
>> 1) Is something wrong in the above config?
>>
>> 2) How are the files gathered from the spool directory? Every time I drop (copy, etc...) a file in it?
>>
>> 3) What happens to the files that were already in the spool directory before I start the flume agent?
>>
>> I would appreciate any Help!
>>
>> Cheers,
>> Emile
>>
>>
>> -------- Original-Nachricht --------
>>> Datum: Tue, 4 Dec 2012 06:48:46 -0800
>>> Von: Mike Percy <mp...@apache.org>
>>> An: user@flume.apache.org
>>> Betreff: Re: A customer use case
>>
>>> Hi Emile,
>>>
>>> On Tue, Dec 4, 2012 at 2:04 AM, Emile Kao <em...@gmx.net> wrote:
>>>>
>>>> 1. Which is the best way to implement such a scenario using Flume/
>>> Hadoop?
>>>>
>>>
>>> You could use the file spooling client / source to stream these files back
>>> in the latest trunk and upcoming Flume 1.3.0 builds, along with hdfs sink.
>>>
>>> 2. The customer would like to keep the log files in thier original state
>>>> (file name, size, etc..). Is it practicable using Flume?
>>>>
>>>
>>> Not recommended. Flume is an event streaming system, not a file copying
>>> mechanism. If you want to do that, just use some scripts with hadoop fs
>>> -put instead of Flume. Flume provides a bunch of stream-oriented features
>>> on top of its event streaming architecture, such as data enrichment
>>> capabilities, event routing, and configurable file rolling on HDFS, to
>>> name
>>> a few.
>>>
>>> Regards,
>>> Mike
>
> --
> Alexander Alten-Lorenz
> http://mapredit.blogspot.com
> German Hadoop LinkedIn Group: http://goo.gl/N8pCF
>

Re: A customer use case / using spoolDir

Posted by Alexander Alten-Lorenz <wg...@gmail.com>.

Hi,

as the error message says:
> No Channels configured for spooldir-1

add: 
agent1.sources.spooldir-1.channels = MemoryChannel-2

When a file is dropped into the source should pick up them. If are files inside they will be processed (if I'm not totally wrong)

- Alex


On Dec 5, 2012, at 1:00 PM, Emile Kao <em...@gmx.net> wrote:

> Hello,
> thank you for the hint to use the new spoolDir feature in the fresh released 1.3.0 version of Flume.
> 
> unfortunately I am not getting the expected result.
> Here is my configuration:
> 
> agent1.channels = MemoryChannel-2
> agent1.channels.MemoryChannel-2.type = memory
> 
> agent1.sources = spooldir-1
> agent1.sources.spooldir-1.type = spooldir
> agent1.sources.spooldir-1.spoolDir = /opt/apache2/logs/flumeSpool
> agent1.sources.spooldir-1.fileHeader = true
> 
> agent1.sinks = HDFS
> agent1.sinks.HDFS.channel = MemoryChannel-2
> agent1.sinks.HDFS.type = hdfs
> agent1.sinks.HDFS.hdfs.fileType = DataStream
> agent1.sinks.HDFS.hdfs.path = hdfs://localhost:9000
> agent1.sinks.HDFS.hdfs.writeFormat = Text
> 
> 
> Upon start I am getting the following warning:
> 2012-12-05 11:05:19,216 (conf-file-poller-0) [WARN - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.validateSources(FlumeConfiguration.java:571)] Removed spooldir-1 due to No Channels configured for spooldir-1
> 
> Question:
> 
> 1) Is something wrong in the above config?
> 
> 2) How are the files gathered from the spool directory? Every time I drop (copy, etc...) a file in it?
> 
> 3) What happens to the files that were already in the spool directory before I start the flume agent?
> 
> I would appreciate any Help!
> 
> Cheers,
> Emile
> 
> 
> -------- Original-Nachricht --------
>> Datum: Tue, 4 Dec 2012 06:48:46 -0800
>> Von: Mike Percy <mp...@apache.org>
>> An: user@flume.apache.org
>> Betreff: Re: A customer use case
> 
>> Hi Emile,
>> 
>> On Tue, Dec 4, 2012 at 2:04 AM, Emile Kao <em...@gmx.net> wrote:
>>> 
>>> 1. Which is the best way to implement such a scenario using Flume/
>> Hadoop?
>>> 
>> 
>> You could use the file spooling client / source to stream these files back
>> in the latest trunk and upcoming Flume 1.3.0 builds, along with hdfs sink.
>> 
>> 2. The customer would like to keep the log files in thier original state
>>> (file name, size, etc..). Is it practicable using Flume?
>>> 
>> 
>> Not recommended. Flume is an event streaming system, not a file copying
>> mechanism. If you want to do that, just use some scripts with hadoop fs
>> -put instead of Flume. Flume provides a bunch of stream-oriented features
>> on top of its event streaming architecture, such as data enrichment
>> capabilities, event routing, and configurable file rolling on HDFS, to
>> name
>> a few.
>> 
>> Regards,
>> Mike

--
Alexander Alten-Lorenz
http://mapredit.blogspot.com
German Hadoop LinkedIn Group: http://goo.gl/N8pCF

A customer use case / using spoolDir

Posted by Emile Kao <em...@gmx.net>.

Hello,
thank you for the hint to use the new spoolDir feature in the fresh released 1.3.0 version of Flume.

unfortunately I am not getting the expected result.
Here is my configuration:

agent1.channels = MemoryChannel-2
agent1.channels.MemoryChannel-2.type = memory

agent1.sources = spooldir-1
agent1.sources.spooldir-1.type = spooldir
agent1.sources.spooldir-1.spoolDir = /opt/apache2/logs/flumeSpool
agent1.sources.spooldir-1.fileHeader = true

agent1.sinks = HDFS
agent1.sinks.HDFS.channel = MemoryChannel-2
agent1.sinks.HDFS.type = hdfs
agent1.sinks.HDFS.hdfs.fileType = DataStream
agent1.sinks.HDFS.hdfs.path = hdfs://localhost:9000
agent1.sinks.HDFS.hdfs.writeFormat = Text


Upon start I am getting the following warning:
2012-12-05 11:05:19,216 (conf-file-poller-0) [WARN - org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.validateSources(FlumeConfiguration.java:571)] Removed spooldir-1 due to No Channels configured for spooldir-1

Question:

1) Is something wrong in the above config?

2) How are the files gathered from the spool directory? Every time I drop (copy, etc...) a file in it?

3) What happens to the files that were already in the spool directory before I start the flume agent?

I would appreciate any Help!

Cheers,
Emile


-------- Original-Nachricht --------
> Datum: Tue, 4 Dec 2012 06:48:46 -0800
> Von: Mike Percy <mp...@apache.org>
> An: user@flume.apache.org
> Betreff: Re: A customer use case

> Hi Emile,
> 
> On Tue, Dec 4, 2012 at 2:04 AM, Emile Kao <em...@gmx.net> wrote:
> >
> > 1. Which is the best way to implement such a scenario using Flume/
> Hadoop?
> >
> 
> You could use the file spooling client / source to stream these files back
> in the latest trunk and upcoming Flume 1.3.0 builds, along with hdfs sink.
> 
> 2. The customer would like to keep the log files in thier original state
> > (file name, size, etc..). Is it practicable using Flume?
> >
> 
> Not recommended. Flume is an event streaming system, not a file copying
> mechanism. If you want to do that, just use some scripts with hadoop fs
> -put instead of Flume. Flume provides a bunch of stream-oriented features
> on top of its event streaming architecture, such as data enrichment
> capabilities, event routing, and configurable file rolling on HDFS, to
> name
> a few.
> 
> Regards,
> Mike

Re: A customer use case

Posted by Mike Percy <mp...@apache.org>.

Hi Emile,

On Tue, Dec 4, 2012 at 2:04 AM, Emile Kao <em...@gmx.net> wrote:
>
> 1. Which is the best way to implement such a scenario using Flume/ Hadoop?
>

You could use the file spooling client / source to stream these files back
in the latest trunk and upcoming Flume 1.3.0 builds, along with hdfs sink.

2. The customer would like to keep the log files in thier original state
> (file name, size, etc..). Is it practicable using Flume?
>

Not recommended. Flume is an event streaming system, not a file copying
mechanism. If you want to do that, just use some scripts with hadoop fs
-put instead of Flume. Flume provides a bunch of stream-oriented features
on top of its event streaming architecture, such as data enrichment
capabilities, event routing, and configurable file rolling on HDFS, to name
a few.

Regards,
Mike

Re: A customer use case

Posted by Nitin Pawar <ni...@gmail.com>.

This is really doable with minimal efforts on your end.

Use flume and hdfs sink. You can actually name the files as you like and
rollover on hdfs based on number of events,size or time.

Developers can then access the logs through hdfs namenode URI or a simple
java dfs client inside a container can solve it as well with more security
in place.

On the question of having better way of collecting logs, yes you can
achieve it by using pipes but will be little complicate for very minimal
performance improvement by my views. Others may suggest it otherwise.

On Tue, Dec 4, 2012 at 3:34 PM, Emile Kao <em...@gmx.net> wrote:

> Hello guys,
> now that I have successfuly setup a running Flume / Hadoop system for my
> customer, I would like to ask for a help in trying to implement a
> requirement requested by the customer:
>
> Here is how the use case is looking like:
>
> 1. Customer has many Apache Web server and WebSphere Application server
> that produce many logs.
>
> 2. Customer wants to provide the logs to the developer team without giving
> them direct access to the machines hosting the logs.
>
> 3. The idea is now to collect all the log files and put them together in
> one place and let the developer team get access to them through a web
> interface.
>
> 4. My goal is to resolve this problem using Flume / Hadoop
>
> Questions:
>
> 1. Which is the best way to implement such a scenario using Flume/ Hadoop?
>
> 2. The customer would like to keep the log files in thier original state
> (file name, size, etc..). Is it practicable using Flume?
>
> 3. Is there a better way to collect the files without using "Exec source"
> and "tail -F" command?
>
> Many Thanks and Cheers,
> Emile
>

-- 
Nitin Pawar