You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Bhaskar <bm...@gmail.com> on 2012/06/13 19:15:35 UTC

Newbee question about flume 1.2 set up

Good Afternoon,
I am a newbee to flume and read thru limited documentation available.  I
would like to set up the following to test out.

1.  Read apache access logs (as source)
2.  Use memory channel
3.  Write it to a NFS (or even local) file system

Can some one help me with the necessary configuration.  I am having
difficult time to glean that information from available documentation.  I
am sure someone has done such test before and i appreciate if you can pass
on that information.  Secondly, I also would like to stream the logs to a
remote server.  Is that a log4j configuration or do i need to run an agent
on each host to do so?  Any configuration examples would be of great help.

Thanks,
Bhaskar

Re: Newbee question about flume 1.2 set up

Posted by Hari Shreedharan <hs...@cloudera.com>.
Hi Bhaskar,

It seems like you did not specify which channel the sink should pick the
data up from. Please add the following line to your conf:

agent1.sinks.svc_0_sink.channel = MemoryChannel-2

(This tells the sink to pick the events up from MemoryChannel-2)

Thanks,
Hari


On Thu, Jun 14, 2012 at 9:52 AM, Bhaskar <bm...@gmail.com> wrote:

> For testing purposes, I tried with the following configuration without
> much luck.  I see that the process started fine but it just does not write
> anything to the sink.  I guess i am missing something here.  Can one of you
> gurus take a look and suggest what i am doing wrong?
>
> Thanks,
> Bhaskar
>
> agent1.sources = tail
> agent1.channels = MemoryChannel-2
> agent1.sinks = svc_0_sink
>
>
> agent1.sources.tail.type = exec
> agent1.sources.tail.command = tail -f /var/log/access.log
> agent1.sources.tail.channels = MemoryChannel-2
>
> agent1.sinks.svc_0_sink.type = FILE_ROLL
> agent1.sinks.svc_0_sink.sink.directory=/flume_runtime/logs
> agent1.sinks.svc_0_sink.rollInterval=0
>
> agent1.channels.MemoryChannel-2.type = memory
>
>
> On Thu, Jun 14, 2012 at 4:26 AM, Guillaume Polaert <gp...@cyres.fr>wrote:
>
>> Hi Bhaskar,
>>
>> This is the flume.conf (http://pastebin.com/WULgUuaf) what I'm using.
>> I have an avro server on the hadoop-m host and one agent per node (slave
>> hosts). Each agent send the ouput of a exec command to avro server.
>>
>> Host1 : exec -> memory -> avro (sink)
>>
>> Host2 : exec -> memory -> avro
>>                                                >>>>>    MainHost : avro
>> (source) -> memory -> rolling file (local FS)
>> ...
>>
>> Host3 : exec -> memory -> avro
>>
>>
>> Use your own exec command to read Apache log.
>>
>> Guillaume Polaert | Cyrès Conseil
>>
>> De : Bhaskar [mailto:bmarthi@gmail.com]
>> Envoyé : mercredi 13 juin 2012 19:16
>> À : flume-user@incubator.apache.org
>> Objet : Newbee question about flume 1.2 set up
>>
>> Good Afternoon,
>> I am a newbee to flume and read thru limited documentation available.  I
>> would like to set up the following to test out.
>>
>> 1.  Read apache access logs (as source)
>> 2.  Use memory channel
>> 3.  Write it to a NFS (or even local) file system
>>
>> Can some one help me with the necessary configuration.  I am having
>> difficult time to glean that information from available documentation.  I
>> am sure someone has done such test before and i appreciate if you can pass
>> on that information.  Secondly, I also would like to stream the logs to a
>> remote server.  Is that a log4j configuration or do i need to run an agent
>> on each host to do so?  Any configuration examples would be of great help.
>>
>> Thanks,
>> Bhaskar
>>
>
>

Re: Newbee question about flume 1.2 set up

Posted by Mohammad Tariq <do...@gmail.com>.
Hello Bhaskar,

        I am really sorry for being unresponsive. Actually I was stuck
with some deliverables..And there is absolutely no need for you to be
sorry..I also do the same thing whenever I am stuck with
something..:)..You can ask as many questions as you want and I'll try
to respond in the best possible manner..You are doing things
normally..I don't find anything getting forcefitted here..And as far
as this issue is concerned it is still unresolved..And how is it going
now??

Regards,
    Mohammad Tariq


On Mon, Jun 18, 2012 at 7:31 PM, Bhaskar <bm...@gmail.com> wrote:
> Any thoughts/guidance please?
>
> Thanks,
> Bhaskar
>
>
> On Fri, Jun 15, 2012 at 11:01 AM, Bhaskar <bm...@gmail.com> wrote:
>>
>> Sorry to be a pest with stream of questions.  I think i am going two steps
>> forward and four steps back:-).  After my first successful attempt, i tried
>> running flume with the following flow:
>>
>> 1.  HostA
>>   -- Source  is tail web server log
>>   -- channel jdbc
>>   -- sink is AVRO collection on Host B
>> Configuraiton:
>> agent3.sources = tail
>> agent3.sinks = avro-forward-sink
>> agent3.channels = jdbc-channel
>>
>> # Define source flow
>> agent3.sources.tail.type = exec
>> agent3.sources.tail.command = tail -f /common/log/access.log
>> agent3.sources.tail.channels = jdbc-channel
>>
>> # define the flow
>> agent3.sinks.avro-forward-sink.channel = jdbc-channel
>>
>> # avro sink properties
>> agent3.sources.avro-forward-sink.type = avro
>> agent3.sources.avro-forward-sink.hostname = <<IP Address>>
>> agent3.sources.avro-forward-sink.port = <<PORT>>
>>
>> # Define channels
>> agent3.channels.jdbc-channel.type = jdbc
>> agent3.channels.jdbc-channel.maximum.capacity = 10
>> agent3.channels.jdbc-channel.maximum.connections = 2
>>
>>
>> 2.  HostB
>>   -- Source is AVRO collection
>>   -- channel is memory
>>   -- sink is local file system
>>
>> Configuration:
>> # list sources, sinks and channels in the agent4
>> agent4.sources = avro-collection-source
>> agent4.sinks = svc_0_sink
>> agent4.channels = MemoryChannel-2
>>
>> # define the flow
>> agent4.sources.avro-collection-source.channels = MemoryChannel-2
>> agent4.sinks.svc_0_sink.channel = MemoryChannel-2
>>
>> # avro sink properties
>> agent4.sources.avro-collection-source.type = avro
>> agent4.sources.avro-collection-source.bind = <<IP Address>>
>> agent4.sources.avro-collection-source.port = <<PORT>>
>>
>> agent4.sinks.svc_0_sink.type = FILE_ROLL
>> agent4.sinks.svc_0_sink.sink.directory=/logs/agent4
>> agent4.sinks.svc_0_sink.rollInterval=600
>> agent4.sinks.svc_0_sink.channel = MemoryChannel-2
>>
>> agent4.channels.MemoryChannel-2.type = memory
>> agent4.channels.MemoryChannel-2.capacity = 100
>> agent4.channels.MemoryChannel-2.transactionCapacity = 10
>>
>>
>> Basically i am trying to tail a file on one host, stream it to another
>> host running sink.  During the trial run, the configuration is loaded fine
>> and i see the channels created fine.  I see an exception from the jdbc
>> channel first (Failed to persist event).  I am getting a java heap space OOM
>> exception from Host B when Host A attempts to write.
>>
>> 2012-06-15 10:31:44,503 WARN ipc.NettyServer: Unexpected exception from
>> downstream.
>> java.lang.OutOfMemoryError: Java heap space
>>
>> This issue was already
>> reported https://issues.apache.org/jira/browse/FLUME-1259 but i am not sure
>> if there is a work around to this problem.  I have couple questions:
>>
>> 1.  Am i force fitting a wrong solution here using AVRO?
>> 2.  if so, what would be a right solution for streaming data from Host A
>> to Host B (or thru intermediaries)?
>>
>> Thanks,
>> Bhaskar
>>
>> On Thu, Jun 14, 2012 at 4:31 PM, Mohammad Tariq <do...@gmail.com>
>> wrote:
>>>
>>> Since you are thinking of using multi-hop flow I would suggest to go
>>> for "JDBC Channel" as there is higher chance of error than single-hop
>>> flow and in JDBC Channel events are stored in a persistent storage
>>> that’s backed by a database. For detailed guidelines you can refer
>>> Flume 1.x User Guide at -
>>>
>>> https://people.apache.org/~mpercy/flume/flume-1.2.0-incubating-SNAPSHOT/docs/FlumeUserGuide.html
>>>
>>> Regards,
>>>     Mohammad Tariq
>>>
>>>
>>> On Fri, Jun 15, 2012 at 12:46 AM, Bhaskar <bm...@gmail.com> wrote:
>>> > Hi Mohammad,
>>> > Thanks for the pointer there.  Do you think using a message queue (like
>>> > rabbitmq) would be a choice of communication channel between each hop?
>>> >  i am
>>> > struggling to get a handle on how i need to configure my sink in
>>> > intermediary hops in a multi-hop flow.    Appreciate any
>>> > guidance/examples.
>>> >
>>> > Thanks,
>>> > Bhaskar
>>> >
>>> >
>>> > On Thu, Jun 14, 2012 at 1:57 PM, Mohammad Tariq <do...@gmail.com>
>>> > wrote:
>>> >>
>>> >> Hello Bhaskar,
>>> >>
>>> >>      That's great..And the best approach to stream logs depends upon
>>> >> the type of source you want to watch for..And by looking at your
>>> >> usecase, I would suggest to go for "multi-hop" flows where events
>>> >> travel through multiple agents before reaching the final destination.
>>> >>
>>> >> Regards,
>>> >>     Mohammad Tariq
>>> >>
>>> >>
>>> >> On Thu, Jun 14, 2012 at 10:48 PM, Bhaskar <bm...@gmail.com> wrote:
>>> >> > I know what i am missing:-)  I missed connecting the sink with the
>>> >> > channel.
>>> >> >  My small POC works now and i am able to view the streamed logs.
>>> >> >  Thank
>>> >> > you
>>> >> > all for the guidance and patience in answering all questions.  So,
>>> >> > whats
>>> >> > the
>>> >> > best approach to stream logs from other hosts?  Basically my next
>>> >> > task
>>> >> > would
>>> >> > be to set up collector (sort of) model to stream logs to
>>> >> > intermediary
>>> >> > and
>>> >> > then stream from collector to a sink location.  I'd appreciate any
>>> >> > thoughts/guidance in this regard.
>>> >> >
>>> >> > Bhaskar
>>> >> >
>>> >> >
>>> >> > On Thu, Jun 14, 2012 at 12:52 PM, Bhaskar <bm...@gmail.com> wrote:
>>> >> >>
>>> >> >> For testing purposes, I tried with the following configuration
>>> >> >> without
>>> >> >> much luck.  I see that the process started fine but it just does
>>> >> >> not
>>> >> >> write
>>> >> >> anything to the sink.  I guess i am missing something here.  Can
>>> >> >> one of
>>> >> >> you
>>> >> >> gurus take a look and suggest what i am doing wrong?
>>> >> >>
>>> >> >> Thanks,
>>> >> >> Bhaskar
>>> >> >>
>>> >> >> agent1.sources = tail
>>> >> >> agent1.channels = MemoryChannel-2
>>> >> >> agent1.sinks = svc_0_sink
>>> >> >>
>>> >> >>
>>> >> >> agent1.sources.tail.type = exec
>>> >> >> agent1.sources.tail.command = tail -f /var/log/access.log
>>> >> >> agent1.sources.tail.channels = MemoryChannel-2
>>> >> >>
>>> >> >> agent1.sinks.svc_0_sink.type = FILE_ROLL
>>> >> >> agent1.sinks.svc_0_sink.sink.directory=/flume_runtime/logs
>>> >> >> agent1.sinks.svc_0_sink.rollInterval=0
>>> >> >>
>>> >> >> agent1.channels.MemoryChannel-2.type = memory
>>> >> >>
>>> >> >>
>>> >> >> On Thu, Jun 14, 2012 at 4:26 AM, Guillaume Polaert
>>> >> >> <gp...@cyres.fr>
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> Hi Bhaskar,
>>> >> >>>
>>> >> >>> This is the flume.conf (http://pastebin.com/WULgUuaf) what I'm
>>> >> >>> using.
>>> >> >>> I have an avro server on the hadoop-m host and one agent per node
>>> >> >>> (slave
>>> >> >>> hosts). Each agent send the ouput of a exec command to avro
>>> >> >>> server.
>>> >> >>>
>>> >> >>> Host1 : exec -> memory -> avro (sink)
>>> >> >>>
>>> >> >>> Host2 : exec -> memory -> avro
>>> >> >>>                                                >>>>>    MainHost :
>>> >> >>> avro
>>> >> >>> (source) -> memory -> rolling file (local FS)
>>> >> >>> ...
>>> >> >>>
>>> >> >>> Host3 : exec -> memory -> avro
>>> >> >>>
>>> >> >>>
>>> >> >>> Use your own exec command to read Apache log.
>>> >> >>>
>>> >> >>> Guillaume Polaert | Cyrès Conseil
>>> >> >>>
>>> >> >>> De : Bhaskar [mailto:bmarthi@gmail.com]
>>> >> >>> Envoyé : mercredi 13 juin 2012 19:16
>>> >> >>> À : flume-user@incubator.apache.org
>>> >> >>> Objet : Newbee question about flume 1.2 set up
>>> >> >>>
>>> >> >>> Good Afternoon,
>>> >> >>> I am a newbee to flume and read thru limited documentation
>>> >> >>> available.
>>> >> >>>  I
>>> >> >>> would like to set up the following to test out.
>>> >> >>>
>>> >> >>> 1.  Read apache access logs (as source)
>>> >> >>> 2.  Use memory channel
>>> >> >>> 3.  Write it to a NFS (or even local) file system
>>> >> >>>
>>> >> >>> Can some one help me with the necessary configuration.  I am
>>> >> >>> having
>>> >> >>> difficult time to glean that information from available
>>> >> >>> documentation.
>>> >> >>>  I am
>>> >> >>> sure someone has done such test before and i appreciate if you can
>>> >> >>> pass on
>>> >> >>> that information.  Secondly, I also would like to stream the logs
>>> >> >>> to a
>>> >> >>> remote server.  Is that a log4j configuration or do i need to run
>>> >> >>> an
>>> >> >>> agent
>>> >> >>> on each host to do so?  Any configuration examples would be of
>>> >> >>> great
>>> >> >>> help.
>>> >> >>>
>>> >> >>> Thanks,
>>> >> >>> Bhaskar
>>> >> >>
>>> >> >>
>>> >> >
>>> >
>>> >
>>
>>
>

Re: Newbee question about flume 1.2 set up

Posted by Bhaskar <bm...@gmail.com>.
Any thoughts/guidance please?

Thanks,
Bhaskar

On Fri, Jun 15, 2012 at 11:01 AM, Bhaskar <bm...@gmail.com> wrote:

> Sorry to be a pest with stream of questions.  I think i am going two steps
> forward and four steps back:-).  After my first successful attempt, i tried
> running flume with the following flow:
>
> 1.  HostA
>   -- Source  is tail web server log
>   -- channel jdbc
>   -- sink is AVRO collection on Host B
> Configuraiton:
> agent3.sources = tail
> agent3.sinks = avro-forward-sink
> agent3.channels = jdbc-channel
>
> # Define source flow
> agent3.sources.tail.type = exec
> agent3.sources.tail.command = tail -f /common/log/access.log
> agent3.sources.tail.channels = jdbc-channel
>
> # define the flow
> agent3.sinks.avro-forward-sink.channel = jdbc-channel
>
> # avro sink properties
> agent3.sources.avro-forward-sink.type = avro
> agent3.sources.avro-forward-sink.hostname = <<IP Address>>
> agent3.sources.avro-forward-sink.port = <<PORT>>
>
> # Define channels
> agent3.channels.jdbc-channel.type = jdbc
> agent3.channels.jdbc-channel.maximum.capacity = 10
> agent3.channels.jdbc-channel.maximum.connections = 2
>
>
> 2.  HostB
>   -- Source is AVRO collection
>   -- channel is memory
>   -- sink is local file system
>
> Configuration:
> # list sources, sinks and channels in the agent4
> agent4.sources = avro-collection-source
> agent4.sinks = svc_0_sink
> agent4.channels = MemoryChannel-2
>
> # define the flow
> agent4.sources.avro-collection-source.channels = MemoryChannel-2
> agent4.sinks.svc_0_sink.channel = MemoryChannel-2
>
> # avro sink properties
> agent4.sources.avro-collection-source.type = avro
> agent4.sources.avro-collection-source.bind = <<IP Address>>
> agent4.sources.avro-collection-source.port = <<PORT>>
>
> agent4.sinks.svc_0_sink.type = FILE_ROLL
> agent4.sinks.svc_0_sink.sink.directory=/logs/agent4
> agent4.sinks.svc_0_sink.rollInterval=600
> agent4.sinks.svc_0_sink.channel = MemoryChannel-2
>
> agent4.channels.MemoryChannel-2.type = memory
> agent4.channels.MemoryChannel-2.capacity = 100
> agent4.channels.MemoryChannel-2.transactionCapacity = 10
>
>
> Basically i am trying to tail a file on one host, stream it to another
> host running sink.  During the trial run, the configuration is loaded fine
> and i see the channels created fine.  I see an exception from the jdbc
> channel first (Failed to persist event).  I am getting a java heap space
> OOM exception from Host B when Host A attempts to write.
>
> 2012-06-15 10:31:44,503 WARN ipc.NettyServer: Unexpected exception from
> downstream.
> java.lang.OutOfMemoryError: Java heap space
>
> This issue was already reported
> https://issues.apache.org/jira/browse/FLUME-1259 but i am not sure if
> there is a work around to this problem.  I have couple questions:
>
> 1.  Am i force fitting a wrong solution here using AVRO?
> 2.  if so, what would be a right solution for streaming data from Host A
> to Host B (or thru intermediaries)?
>
> Thanks,
> Bhaskar
>
> On Thu, Jun 14, 2012 at 4:31 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Since you are thinking of using multi-hop flow I would suggest to go
>> for "JDBC Channel" as there is higher chance of error than single-hop
>> flow and in JDBC Channel events are stored in a persistent storage
>> that’s backed by a database. For detailed guidelines you can refer
>> Flume 1.x User Guide at -
>>
>> https://people.apache.org/~mpercy/flume/flume-1.2.0-incubating-SNAPSHOT/docs/FlumeUserGuide.html
>>
>> Regards,
>>     Mohammad Tariq
>>
>>
>> On Fri, Jun 15, 2012 at 12:46 AM, Bhaskar <bm...@gmail.com> wrote:
>> > Hi Mohammad,
>> > Thanks for the pointer there.  Do you think using a message queue (like
>> > rabbitmq) would be a choice of communication channel between each hop?
>>  i am
>> > struggling to get a handle on how i need to configure my sink in
>> > intermediary hops in a multi-hop flow.    Appreciate any
>> guidance/examples.
>> >
>> > Thanks,
>> > Bhaskar
>> >
>> >
>> > On Thu, Jun 14, 2012 at 1:57 PM, Mohammad Tariq <do...@gmail.com>
>> wrote:
>> >>
>> >> Hello Bhaskar,
>> >>
>> >>      That's great..And the best approach to stream logs depends upon
>> >> the type of source you want to watch for..And by looking at your
>> >> usecase, I would suggest to go for "multi-hop" flows where events
>> >> travel through multiple agents before reaching the final destination.
>> >>
>> >> Regards,
>> >>     Mohammad Tariq
>> >>
>> >>
>> >> On Thu, Jun 14, 2012 at 10:48 PM, Bhaskar <bm...@gmail.com> wrote:
>> >> > I know what i am missing:-)  I missed connecting the sink with the
>> >> > channel.
>> >> >  My small POC works now and i am able to view the streamed logs.
>>  Thank
>> >> > you
>> >> > all for the guidance and patience in answering all questions.  So,
>> whats
>> >> > the
>> >> > best approach to stream logs from other hosts?  Basically my next
>> task
>> >> > would
>> >> > be to set up collector (sort of) model to stream logs to intermediary
>> >> > and
>> >> > then stream from collector to a sink location.  I'd appreciate any
>> >> > thoughts/guidance in this regard.
>> >> >
>> >> > Bhaskar
>> >> >
>> >> >
>> >> > On Thu, Jun 14, 2012 at 12:52 PM, Bhaskar <bm...@gmail.com> wrote:
>> >> >>
>> >> >> For testing purposes, I tried with the following configuration
>> without
>> >> >> much luck.  I see that the process started fine but it just does not
>> >> >> write
>> >> >> anything to the sink.  I guess i am missing something here.  Can
>> one of
>> >> >> you
>> >> >> gurus take a look and suggest what i am doing wrong?
>> >> >>
>> >> >> Thanks,
>> >> >> Bhaskar
>> >> >>
>> >> >> agent1.sources = tail
>> >> >> agent1.channels = MemoryChannel-2
>> >> >> agent1.sinks = svc_0_sink
>> >> >>
>> >> >>
>> >> >> agent1.sources.tail.type = exec
>> >> >> agent1.sources.tail.command = tail -f /var/log/access.log
>> >> >> agent1.sources.tail.channels = MemoryChannel-2
>> >> >>
>> >> >> agent1.sinks.svc_0_sink.type = FILE_ROLL
>> >> >> agent1.sinks.svc_0_sink.sink.directory=/flume_runtime/logs
>> >> >> agent1.sinks.svc_0_sink.rollInterval=0
>> >> >>
>> >> >> agent1.channels.MemoryChannel-2.type = memory
>> >> >>
>> >> >>
>> >> >> On Thu, Jun 14, 2012 at 4:26 AM, Guillaume Polaert <
>> gpolaert@cyres.fr>
>> >> >> wrote:
>> >> >>>
>> >> >>> Hi Bhaskar,
>> >> >>>
>> >> >>> This is the flume.conf (http://pastebin.com/WULgUuaf) what I'm
>> using.
>> >> >>> I have an avro server on the hadoop-m host and one agent per node
>> >> >>> (slave
>> >> >>> hosts). Each agent send the ouput of a exec command to avro server.
>> >> >>>
>> >> >>> Host1 : exec -> memory -> avro (sink)
>> >> >>>
>> >> >>> Host2 : exec -> memory -> avro
>> >> >>>                                                >>>>>    MainHost :
>> >> >>> avro
>> >> >>> (source) -> memory -> rolling file (local FS)
>> >> >>> ...
>> >> >>>
>> >> >>> Host3 : exec -> memory -> avro
>> >> >>>
>> >> >>>
>> >> >>> Use your own exec command to read Apache log.
>> >> >>>
>> >> >>> Guillaume Polaert | Cyrès Conseil
>> >> >>>
>> >> >>> De : Bhaskar [mailto:bmarthi@gmail.com]
>> >> >>> Envoyé : mercredi 13 juin 2012 19:16
>> >> >>> À : flume-user@incubator.apache.org
>> >> >>> Objet : Newbee question about flume 1.2 set up
>> >> >>>
>> >> >>> Good Afternoon,
>> >> >>> I am a newbee to flume and read thru limited documentation
>> available.
>> >> >>>  I
>> >> >>> would like to set up the following to test out.
>> >> >>>
>> >> >>> 1.  Read apache access logs (as source)
>> >> >>> 2.  Use memory channel
>> >> >>> 3.  Write it to a NFS (or even local) file system
>> >> >>>
>> >> >>> Can some one help me with the necessary configuration.  I am having
>> >> >>> difficult time to glean that information from available
>> documentation.
>> >> >>>  I am
>> >> >>> sure someone has done such test before and i appreciate if you can
>> >> >>> pass on
>> >> >>> that information.  Secondly, I also would like to stream the logs
>> to a
>> >> >>> remote server.  Is that a log4j configuration or do i need to run
>> an
>> >> >>> agent
>> >> >>> on each host to do so?  Any configuration examples would be of
>> great
>> >> >>> help.
>> >> >>>
>> >> >>> Thanks,
>> >> >>> Bhaskar
>> >> >>
>> >> >>
>> >> >
>> >
>> >
>>
>
>

Re: Newbee question about flume 1.2 set up

Posted by Mohammad Tariq <do...@gmail.com>.
Nothing complex..just a simple agent that takes Apache web server logs
and dumps into a directory having the "hostname" as the directory name
:

agent1.sources = tail
agent1.channels = MemoryChannel-2
agent1.sinks = HDFS

agent1.sources.tail.type = exec
agent1.sources.tail.command = tail -F /var/log/apache2/access.log
agent1.sources.tail.channels = MemoryChannel-2
agent1.sources.tail.interceptors = hostint
agent1.sources.tail.interceptors.hostint.type =
org.apache.flume.interceptor.HostInterceptor$Builder
agent1.sources.tail.interceptors.hostint.preserveExisting = true
agent1.sources.tail.interceptors.hostint.useIP = true

agent1.sinks.HDFS.channel = MemoryChannel-2
agent1.sinks.HDFS.type = hdfs
agent1.sinks.HDFS.hdfs.path = hdfs://localhost:9000/flume/%{host}
agent1.sinks.HDFS.hdfs.file.Type = DataStream
agent1.sinks.HDFS.hdfs.writeFormat = Text

agent1.channels.MemoryChannel-2.type = memory

Regards,
    Mohammad Tariq


On Wed, Jun 20, 2012 at 7:46 PM, Bhaskar <bm...@gmail.com> wrote:
> Glad to hear that Mohammad.  i am going to give it a try after building from
> trunk.  What is your final configuration if i may ask?
>
> Bhaskar
>
>
> On Wed, Jun 20, 2012 at 7:12 AM, Mohammad Tariq <do...@gmail.com> wrote:
>>
>> Great work guys..I am really thankful to you all.. It's working perfectly
>> fine.
>>
>> Regards,
>>     Mohammad Tariq
>>
>>
>> On Wed, Jun 20, 2012 at 9:10 AM, Juhani Connolly
>> <ju...@cyberagent.co.jp> wrote:
>> > That was just committed yesterday, so you will probably also need to
>> > pull
>> > the new trunk and rebuild.
>> >
>> > Setting it up is easy, just add to your source:
>> >
>> > agent.sources.mysource.interceptors = hostint
>> > agent.sources.mysource.interceptors.hostint.type =
>> > org.apache.flume.interceptor.HostInterceptor$Builder
>> > agent.sources.mysource.interceptors.hostint.preserveExisting = true
>> > agent.sources.mysource.interceptors.hostint.useIP = false
>> >
>> > This will add a header at the original host(without overwriting it after
>> > redirects), which can then be used in your path.
>> >
>> >
>> > On 06/20/2012 11:59 AM, Mike Percy wrote:
>> >>
>> >> Will just contributed an Interceptor to provide this out of the box:
>> >>
>> >> https://issues.apache.org/jira/browse/FLUME-1284
>> >>
>> >> Regards
>> >> Mike
>> >>
>> >>
>> >> On Tuesday, June 19, 2012 at 2:54 PM, Mohammad Tariq wrote:
>> >>
>> >>> There is no problem from your side..Have a look at this -
>> >>>
>> >>>
>> >>> http://mail-archives.apache.org/mod_mbox/incubator-flume-user/201206.mbox/%3CCAGPLoJKLthyoecEYnJRscahe8q6i4kKH1ADsL3qoQCAQo=ig9g@mail.gmail.com%3E
>> >>>
>> >>> Regards,
>> >>> Mohammad Tariq
>> >>>
>> >>>
>> >>> On Wed, Jun 20, 2012 at 1:42 AM, Bhaskar<bmarthi@gmail.com
>> >>> (mailto:bmarthi@gmail.com)>  wrote:
>> >>>>
>> >>>> Unfortunately, that part is not working as expected. Must be my
>> >>>> mistake
>> >>>> somewhere in the configuration. Here is my sink configuration.
>> >>>>
>> >>>> agent4.sinks.svc_0_sink.type = FILE_ROLL
>> >>>> agent4.sinks.svc_0_sink.sink.directory=/var/logs/agent4/%{host}
>> >>>> agent4.sinks.svc_0_sink.rollInterval=5400
>> >>>> agent4.sinks.svc_0_sink.channel = MemoryChannel-2
>> >>>>
>> >>>> Any thoughts on how to define host specific directory/file?
>> >>>>
>> >>>> Bhaskar
>> >>>>
>> >>>> On Tue, Jun 19, 2012 at 10:01 AM, Mohammad Tariq<dontariq@gmail.com
>> >>>> (mailto:dontariq@gmail.com)>  wrote:
>> >>>>>
>> >>>>>
>> >>>>> Hello Bhaskar,
>> >>>>>
>> >>>>> That's great...And we can use "%{host}" as the escape
>> >>>>> sequence to prefix our filenames(am I getting you correctly???).And
>> >>>>> I
>> >>>>> am waiting anxiously for your guide as I am still a newbie..:-)
>> >>>>>
>> >>>>> Regards,
>> >>>>> Mohammad Tariq
>> >>>>>
>> >>>>>
>> >>>>> On Tue, Jun 19, 2012 at 7:16 PM, Bhaskar<bmarthi@gmail.com
>> >>>>> (mailto:bmarthi@gmail.com)>  wrote:
>> >>>>>>
>> >>>>>> Thank you guys for the responses. I actually was able to get around
>> >>>>>> this
>> >>>>>> problem by tinkering around with my setting. I finally ended up
>> >>>>>> with a
>> >>>>>> capacity of 10000 and commented out transactionCapacity (i
>> >>>>>> originally
>> >>>>>> set it
>> >>>>>> to 10) and it started working. Thanks for the insight. It took me a
>> >>>>>> bit of
>> >>>>>> time to figure out the inner workings of AVRO to get it to send
>> >>>>>> data
>> >>>>>> in
>> >>>>>> correct format. So, i got over that hump:-). Here is my flow for
>> >>>>>> POC.
>> >>>>>>
>> >>>>>> Host A agent -->  Source tail exec -->  AVRO Client Sink -->  jdbc
>> >>>>>> channel
>> >>>>>> (flume-ng avro-client -H<<Host>>  -p<<port>>  -F<<file to read>>
>> >>>>>> --conf
>> >>>>>> ../conf/)
>> >>>>>> Host B agent -->  Source tail exec -->  AVRO Client Sink -->  jdbc
>> >>>>>> channel
>> >>>>>> (flume-ng avro-client -H<<Host>>  -p<<port>>  -F<<file to read>>
>> >>>>>> --conf
>> >>>>>> ../conf/)
>> >>>>>> Host C agent -->  avro-collector source -->  file sink to local
>> >>>>>> directory
>> >>>>>> --
>> >>>>>> Memory channel
>> >>>>>>
>> >>>>>> The issue i am running into is, I am unable to uniquely identify
>> >>>>>> the
>> >>>>>> source
>> >>>>>> of the log in the sink (means the log events from Host A and Host B
>> >>>>>> are
>> >>>>>> combined into the same log on the disk and mixed up). Is there a
>> >>>>>> way
>> >>>>>> to
>> >>>>>> provide unique identifier from the source so that we can track the
>> >>>>>> origin of
>> >>>>>> the log? I am hoping to see in my sink log,
>> >>>>>>
>> >>>>>> Host A -- some log entry
>> >>>>>> Host B -- Some log entry etc
>> >>>>>>
>> >>>>>> Is this feasible or are there any alternative mechanisms to achieve
>> >>>>>> this? I
>> >>>>>> am putting together a new bee guide that might help answer some of
>> >>>>>> these
>> >>>>>> questions for others as i explore this architecture.
>> >>>>>>
>> >>>>>> As always thanks for your assistance,
>> >>>>>> Bhaskar
>> >>>>>>
>> >>>>>>
>> >>>>>> On Tue, Jun 19, 2012 at 2:59 AM, Juhani Connolly
>> >>>>>> <juhani_connolly@cyberagent.co.jp
>> >>>>>> (mailto:juhani_connolly@cyberagent.co.jp)>  wrote:
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Hello Bhaskar,
>> >>>>>>>
>> >>>>>>> Using Avro is generally the recommended method to handle multi-hop
>> >>>>>>> flows,
>> >>>>>>> so no concerns there.
>> >>>>>>>
>> >>>>>>> Have you tried this setup using memory channels instead of jdbc?
>> >>>>>>> Last
>> >>>>>>> time
>> >>>>>>> I tested it, the JDBC channel had poor throughput, so you may be
>> >>>>>>> getting a
>> >>>>>>> logjam somewhere. How much data is getting entered into your
>> >>>>>>> logfile?
>> >>>>>>> Try
>> >>>>>>> raising the capacity on your jdbc channel by a lot(10000?). With a
>> >>>>>>> capacity
>> >>>>>>> of 10, if the reading side(host b) isn't polling frequently
>> >>>>>>> enough,
>> >>>>>>> there's
>> >>>>>>> going to be problems. This is probably why you get the "failed to
>> >>>>>>> persist
>> >>>>>>> event". As far as FLUME-1259 is concerned, that should only be
>> >>>>>>> happening if
>> >>>>>>> bad data is being sent. You're not sending anything else to the
>> >>>>>>> same
>> >>>>>>> port
>> >>>>>>> are you? Make sure that only the source and sink are set to that
>> >>>>>>> port
>> >>>>>>> and
>> >>>>>>> that nothing else is.
>> >>>>>>>
>> >>>>>>> If the problem continues, please post a chunk of the logs leading
>> >>>>>>> up
>> >>>>>>> to
>> >>>>>>> the OOM error(the full trace for the cause should be enough)
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On 06/16/2012 12:01 AM, Bhaskar wrote:
>> >>>>>>>
>> >>>>>>> Sorry to be a pest with stream of questions. I think i am going
>> >>>>>>> two
>> >>>>>>> steps
>> >>>>>>> forward and four steps back:-). After my first successful attempt,
>> >>>>>>> i
>> >>>>>>> tried
>> >>>>>>> running flume with the following flow:
>> >>>>>>>
>> >>>>>>> 1. HostA
>> >>>>>>> -- Source is tail web server log
>> >>>>>>> -- channel jdbc
>> >>>>>>> -- sink is AVRO collection on Host B
>> >>>>>>> Configuraiton:
>> >>>>>>> agent3.sources = tail
>> >>>>>>> agent3.sinks = avro-forward-sink
>> >>>>>>> agent3.channels = jdbc-channel
>> >>>>>>>
>> >>>>>>> # Define source flow
>> >>>>>>> agent3.sources.tail.type = exec
>> >>>>>>> agent3.sources.tail.command = tail -f /common/log/access.log
>> >>>>>>> agent3.sources.tail.channels = jdbc-channel
>> >>>>>>>
>> >>>>>>> # define the flow
>> >>>>>>> agent3.sinks.avro-forward-sink.channel = jdbc-channel
>> >>>>>>>
>> >>>>>>> # avro sink properties
>> >>>>>>> agent3.sources.avro-forward-sink.type = avro
>> >>>>>>> agent3.sources.avro-forward-sink.hostname =<<IP Address>>
>> >>>>>>> agent3.sources.avro-forward-sink.port =<<PORT>>
>> >>>>>>>
>> >>>>>>> # Define channels
>> >>>>>>> agent3.channels.jdbc-channel.type = jdbc
>> >>>>>>> agent3.channels.jdbc-channel.maximum.capacity = 10
>> >>>>>>> agent3.channels.jdbc-channel.maximum.connections = 2
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> 2. HostB
>> >>>>>>> -- Source is AVRO collection
>> >>>>>>> -- channel is memory
>> >>>>>>> -- sink is local file system
>> >>>>>>>
>> >>>>>>> Configuration:
>> >>>>>>> # list sources, sinks and channels in the agent4
>> >>>>>>> agent4.sources = avro-collection-source
>> >>>>>>> agent4.sinks = svc_0_sink
>> >>>>>>> agent4.channels = MemoryChannel-2
>> >>>>>>>
>> >>>>>>> # define the flow
>> >>>>>>> agent4.sources.avro-collection-source.channels = MemoryChannel-2
>> >>>>>>> agent4.sinks.svc_0_sink.channel = MemoryChannel-2
>> >>>>>>>
>> >>>>>>> # avro sink properties
>> >>>>>>> agent4.sources.avro-collection-source.type = avro
>> >>>>>>> agent4.sources.avro-collection-source.bind =<<IP Address>>
>> >>>>>>> agent4.sources.avro-collection-source.port =<<PORT>>
>> >>>>>>>
>> >>>>>>> agent4.sinks.svc_0_sink.type = FILE_ROLL
>> >>>>>>> agent4.sinks.svc_0_sink.sink.directory=/logs/agent4
>> >>>>>>> agent4.sinks.svc_0_sink.rollInterval=600
>> >>>>>>> agent4.sinks.svc_0_sink.channel = MemoryChannel-2
>> >>>>>>>
>> >>>>>>> agent4.channels.MemoryChannel-2.type = memory
>> >>>>>>> agent4.channels.MemoryChannel-2.capacity = 100
>> >>>>>>> agent4.channels.MemoryChannel-2.transactionCapacity = 10
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Basically i am trying to tail a file on one host, stream it to
>> >>>>>>> another
>> >>>>>>> host running sink. During the trial run, the configuration is
>> >>>>>>> loaded
>> >>>>>>> fine
>> >>>>>>> and i see the channels created fine. I see an exception from the
>> >>>>>>> jdbc
>> >>>>>>> channel first (Failed to persist event). I am getting a java heap
>> >>>>>>> space OOM
>> >>>>>>> exception from Host B when Host A attempts to write.
>> >>>>>>>
>> >>>>>>> 2012-06-15 10:31:44,503 WARN ipc.NettyServer: Unexpected exception
>> >>>>>>> from
>> >>>>>>> downstream.
>> >>>>>>> java.lang.OutOfMemoryError: Java heap space
>> >>>>>>>
>> >>>>>>> This issue was already
>> >>>>>>> reported https://issues.apache.org/jira/browse/FLUME-1259 but i am
>> >>>>>>> not
>> >>>>>>> sure
>> >>>>>>> if there is a work around to this problem. I have couple
>> >>>>>>> questions:
>> >>>>>>>
>> >>>>>>> 1. Am i force fitting a wrong solution here using AVRO?
>> >>>>>>> 2. if so, what would be a right solution for streaming data from
>> >>>>>>> Host
>> >>>>>>> A
>> >>>>>>> to Host B (or thru intermediaries)?
>> >>>>>>>
>> >>>>>>> Thanks,
>> >>>>>>> Bhaskar
>> >>>>>>>
>> >>>>>>> On Thu, Jun 14, 2012 at 4:31 PM, Mohammad Tariq<dontariq@gmail.com
>> >>>>>>> (mailto:dontariq@gmail.com)>
>> >>>>>>> wrote:
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Since you are thinking of using multi-hop flow I would suggest to
>> >>>>>>>> go
>> >>>>>>>> for "JDBC Channel" as there is higher chance of error than
>> >>>>>>>> single-hop
>> >>>>>>>> flow and in JDBC Channel events are stored in a persistent
>> >>>>>>>> storage
>> >>>>>>>> that’s backed by a database. For detailed guidelines you can
>> >>>>>>>> refer
>> >>>>>>>> Flume 1.x User Guide at -
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> https://people.apache.org/~mpercy/flume/flume-1.2.0-incubating-SNAPSHOT/docs/FlumeUserGuide.html
>> >>>>>>>>
>> >>>>>>>> Regards,
>> >>>>>>>> Mohammad Tariq
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Fri, Jun 15, 2012 at 12:46 AM, Bhaskar<bmarthi@gmail.com
>> >>>>>>>> (mailto:bmarthi@gmail.com)>  wrote:
>> >>>>>>>>>
>> >>>>>>>>> Hi Mohammad,
>> >>>>>>>>> Thanks for the pointer there. Do you think using a message queue
>> >>>>>>>>> (like
>> >>>>>>>>> rabbitmq) would be a choice of communication channel between
>> >>>>>>>>> each
>> >>>>>>>>> hop?
>> >>>>>>>>> i am
>> >>>>>>>>> struggling to get a handle on how i need to configure my sink in
>> >>>>>>>>> intermediary hops in a multi-hop flow. Appreciate any
>> >>>>>>>>> guidance/examples.
>> >>>>>>>>>
>> >>>>>>>>> Thanks,
>> >>>>>>>>> Bhaskar
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> On Thu, Jun 14, 2012 at 1:57 PM, Mohammad
>> >>>>>>>>> Tariq<dontariq@gmail.com
>> >>>>>>>>> (mailto:dontariq@gmail.com)>
>> >>>>>>>>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Hello Bhaskar,
>> >>>>>>>>>>
>> >>>>>>>>>> That's great..And the best approach to stream logs depends
>> >>>>>>>>>> upon
>> >>>>>>>>>> the type of source you want to watch for..And by looking at
>> >>>>>>>>>> your
>> >>>>>>>>>> usecase, I would suggest to go for "multi-hop" flows where
>> >>>>>>>>>> events
>> >>>>>>>>>> travel through multiple agents before reaching the final
>> >>>>>>>>>> destination.
>> >>>>>>>>>>
>> >>>>>>>>>> Regards,
>> >>>>>>>>>> Mohammad Tariq
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> On Thu, Jun 14, 2012 at 10:48 PM, Bhaskar<bmarthi@gmail.com
>> >>>>>>>>>> (mailto:bmarthi@gmail.com)>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>> I know what i am missing:-) I missed connecting the sink with
>> >>>>>>>>>>> the
>> >>>>>>>>>>> channel.
>> >>>>>>>>>>> My small POC works now and i am able to view the streamed
>> >>>>>>>>>>> logs.
>> >>>>>>>>>>> Thank
>> >>>>>>>>>>> you
>> >>>>>>>>>>> all for the guidance and patience in answering all questions.
>> >>>>>>>>>>> So,
>> >>>>>>>>>>> whats
>> >>>>>>>>>>> the
>> >>>>>>>>>>> best approach to stream logs from other hosts? Basically my
>> >>>>>>>>>>> next
>> >>>>>>>>>>> task
>> >>>>>>>>>>> would
>> >>>>>>>>>>> be to set up collector (sort of) model to stream logs to
>> >>>>>>>>>>> intermediary
>> >>>>>>>>>>> and
>> >>>>>>>>>>> then stream from collector to a sink location. I'd appreciate
>> >>>>>>>>>>> any
>> >>>>>>>>>>> thoughts/guidance in this regard.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Bhaskar
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> On Thu, Jun 14, 2012 at 12:52 PM, Bhaskar<bmarthi@gmail.com
>> >>>>>>>>>>> (mailto:bmarthi@gmail.com)>
>> >>>>>>>>>>> wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> For testing purposes, I tried with the following
>> >>>>>>>>>>>> configuration
>> >>>>>>>>>>>> without
>> >>>>>>>>>>>> much luck. I see that the process started fine but it just
>> >>>>>>>>>>>> does
>> >>>>>>>>>>>> not
>> >>>>>>>>>>>> write
>> >>>>>>>>>>>> anything to the sink. I guess i am missing something here.
>> >>>>>>>>>>>> Can
>> >>>>>>>>>>>> one of
>> >>>>>>>>>>>> you
>> >>>>>>>>>>>> gurus take a look and suggest what i am doing wrong?
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>> Bhaskar
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> agent1.sources = tail
>> >>>>>>>>>>>> agent1.channels = MemoryChannel-2
>> >>>>>>>>>>>> agent1.sinks = svc_0_sink
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> agent1.sources.tail.type = exec
>> >>>>>>>>>>>> agent1.sources.tail.command = tail -f /var/log/access.log
>> >>>>>>>>>>>> agent1.sources.tail.channels = MemoryChannel-2
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> agent1.sinks.svc_0_sink.type = FILE_ROLL
>> >>>>>>>>>>>> agent1.sinks.svc_0_sink.sink.directory=/flume_runtime/logs
>> >>>>>>>>>>>> agent1.sinks.svc_0_sink.rollInterval=0
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> agent1.channels.MemoryChannel-2.type = memory
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> On Thu, Jun 14, 2012 at 4:26 AM, Guillaume Polaert
>> >>>>>>>>>>>> <gpolaert@cyres.fr (mailto:gpolaert@cyres.fr)>
>> >>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Hi Bhaskar,
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> This is the flume.conf (http://pastebin.com/WULgUuaf) what
>> >>>>>>>>>>>>> I'm
>> >>>>>>>>>>>>> using.
>> >>>>>>>>>>>>> I have an avro server on the hadoop-m host and one agent per
>> >>>>>>>>>>>>> node
>> >>>>>>>>>>>>> (slave
>> >>>>>>>>>>>>> hosts). Each agent send the ouput of a exec command to avro
>> >>>>>>>>>>>>> server.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Host1 : exec ->  memory ->  avro (sink)
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Host2 : exec ->  memory ->  avro
>> >>>>>>>>>>>>> MainHost :
>> >>>>>>>>>>>>> avro
>> >>>>>>>>>>>>> (source) ->  memory ->  rolling file (local FS)
>> >>>>>>>>>>>>> ...
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Host3 : exec ->  memory ->  avro
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Use your own exec command to read Apache log.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Guillaume Polaert | Cyrès Conseil
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> De : Bhaskar [mailto:bmarthi@gmail.com]
>> >>>>>>>>>>>>> Envoyé : mercredi 13 juin 2012 19:16
>> >>>>>>>>>>>>> À : flume-user@incubator.apache.org
>> >>>>>>>>>>>>> (mailto:flume-user@incubator.apache.org)
>> >>>>>>>>>>>>> Objet : Newbee question about flume 1.2 set up
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Good Afternoon,
>> >>>>>>>>>>>>> I am a newbee to flume and read thru limited documentation
>> >>>>>>>>>>>>> available.
>> >>>>>>>>>>>>> I
>> >>>>>>>>>>>>> would like to set up the following to test out.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> 1. Read apache access logs (as source)
>> >>>>>>>>>>>>> 2. Use memory channel
>> >>>>>>>>>>>>> 3. Write it to a NFS (or even local) file system
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Can some one help me with the necessary configuration. I am
>> >>>>>>>>>>>>> having
>> >>>>>>>>>>>>> difficult time to glean that information from available
>> >>>>>>>>>>>>> documentation.
>> >>>>>>>>>>>>> I am
>> >>>>>>>>>>>>> sure someone has done such test before and i appreciate if
>> >>>>>>>>>>>>> you
>> >>>>>>>>>>>>> can
>> >>>>>>>>>>>>> pass on
>> >>>>>>>>>>>>> that information. Secondly, I also would like to stream the
>> >>>>>>>>>>>>> logs
>> >>>>>>>>>>>>> to a
>> >>>>>>>>>>>>> remote server. Is that a log4j configuration or do i need to
>> >>>>>>>>>>>>> run
>> >>>>>>>>>>>>> an
>> >>>>>>>>>>>>> agent
>> >>>>>>>>>>>>> on each host to do so? Any configuration examples would be
>> >>>>>>>>>>>>> of
>> >>>>>>>>>>>>> great
>> >>>>>>>>>>>>> help.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>>> Bhaskar
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> >>
>> >>
>> >
>
>

Re: Newbee question about flume 1.2 set up

Posted by Bhaskar <bm...@gmail.com>.
Glad to hear that Mohammad.  i am going to give it a try after building
from trunk.  What is your final configuration if i may ask?

Bhaskar

On Wed, Jun 20, 2012 at 7:12 AM, Mohammad Tariq <do...@gmail.com> wrote:

> Great work guys..I am really thankful to you all.. It's working perfectly
> fine.
>
> Regards,
>     Mohammad Tariq
>
>
> On Wed, Jun 20, 2012 at 9:10 AM, Juhani Connolly
> <ju...@cyberagent.co.jp> wrote:
> > That was just committed yesterday, so you will probably also need to pull
> > the new trunk and rebuild.
> >
> > Setting it up is easy, just add to your source:
> >
> > agent.sources.mysource.interceptors = hostint
> > agent.sources.mysource.interceptors.hostint.type =
> > org.apache.flume.interceptor.HostInterceptor$Builder
> > agent.sources.mysource.interceptors.hostint.preserveExisting = true
> > agent.sources.mysource.interceptors.hostint.useIP = false
> >
> > This will add a header at the original host(without overwriting it after
> > redirects), which can then be used in your path.
> >
> >
> > On 06/20/2012 11:59 AM, Mike Percy wrote:
> >>
> >> Will just contributed an Interceptor to provide this out of the box:
> >>
> >> https://issues.apache.org/jira/browse/FLUME-1284
> >>
> >> Regards
> >> Mike
> >>
> >>
> >> On Tuesday, June 19, 2012 at 2:54 PM, Mohammad Tariq wrote:
> >>
> >>> There is no problem from your side..Have a look at this -
> >>>
> >>>
> http://mail-archives.apache.org/mod_mbox/incubator-flume-user/201206.mbox/%3CCAGPLoJKLthyoecEYnJRscahe8q6i4kKH1ADsL3qoQCAQo=ig9g@mail.gmail.com%3E
> >>>
> >>> Regards,
> >>> Mohammad Tariq
> >>>
> >>>
> >>> On Wed, Jun 20, 2012 at 1:42 AM, Bhaskar<bmarthi@gmail.com
> >>> (mailto:bmarthi@gmail.com)>  wrote:
> >>>>
> >>>> Unfortunately, that part is not working as expected. Must be my
> mistake
> >>>> somewhere in the configuration. Here is my sink configuration.
> >>>>
> >>>> agent4.sinks.svc_0_sink.type = FILE_ROLL
> >>>> agent4.sinks.svc_0_sink.sink.directory=/var/logs/agent4/%{host}
> >>>> agent4.sinks.svc_0_sink.rollInterval=5400
> >>>> agent4.sinks.svc_0_sink.channel = MemoryChannel-2
> >>>>
> >>>> Any thoughts on how to define host specific directory/file?
> >>>>
> >>>> Bhaskar
> >>>>
> >>>> On Tue, Jun 19, 2012 at 10:01 AM, Mohammad Tariq<dontariq@gmail.com
> >>>> (mailto:dontariq@gmail.com)>  wrote:
> >>>>>
> >>>>>
> >>>>> Hello Bhaskar,
> >>>>>
> >>>>> That's great...And we can use "%{host}" as the escape
> >>>>> sequence to prefix our filenames(am I getting you correctly???).And I
> >>>>> am waiting anxiously for your guide as I am still a newbie..:-)
> >>>>>
> >>>>> Regards,
> >>>>> Mohammad Tariq
> >>>>>
> >>>>>
> >>>>> On Tue, Jun 19, 2012 at 7:16 PM, Bhaskar<bmarthi@gmail.com
> >>>>> (mailto:bmarthi@gmail.com)>  wrote:
> >>>>>>
> >>>>>> Thank you guys for the responses. I actually was able to get around
> >>>>>> this
> >>>>>> problem by tinkering around with my setting. I finally ended up
> with a
> >>>>>> capacity of 10000 and commented out transactionCapacity (i
> originally
> >>>>>> set it
> >>>>>> to 10) and it started working. Thanks for the insight. It took me a
> >>>>>> bit of
> >>>>>> time to figure out the inner workings of AVRO to get it to send data
> >>>>>> in
> >>>>>> correct format. So, i got over that hump:-). Here is my flow for
> POC.
> >>>>>>
> >>>>>> Host A agent -->  Source tail exec -->  AVRO Client Sink -->  jdbc
> >>>>>> channel
> >>>>>> (flume-ng avro-client -H<<Host>>  -p<<port>>  -F<<file to read>>
> >>>>>> --conf
> >>>>>> ../conf/)
> >>>>>> Host B agent -->  Source tail exec -->  AVRO Client Sink -->  jdbc
> >>>>>> channel
> >>>>>> (flume-ng avro-client -H<<Host>>  -p<<port>>  -F<<file to read>>
> >>>>>> --conf
> >>>>>> ../conf/)
> >>>>>> Host C agent -->  avro-collector source -->  file sink to local
> >>>>>> directory
> >>>>>> --
> >>>>>> Memory channel
> >>>>>>
> >>>>>> The issue i am running into is, I am unable to uniquely identify the
> >>>>>> source
> >>>>>> of the log in the sink (means the log events from Host A and Host B
> >>>>>> are
> >>>>>> combined into the same log on the disk and mixed up). Is there a way
> >>>>>> to
> >>>>>> provide unique identifier from the source so that we can track the
> >>>>>> origin of
> >>>>>> the log? I am hoping to see in my sink log,
> >>>>>>
> >>>>>> Host A -- some log entry
> >>>>>> Host B -- Some log entry etc
> >>>>>>
> >>>>>> Is this feasible or are there any alternative mechanisms to achieve
> >>>>>> this? I
> >>>>>> am putting together a new bee guide that might help answer some of
> >>>>>> these
> >>>>>> questions for others as i explore this architecture.
> >>>>>>
> >>>>>> As always thanks for your assistance,
> >>>>>> Bhaskar
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Jun 19, 2012 at 2:59 AM, Juhani Connolly
> >>>>>> <juhani_connolly@cyberagent.co.jp
> >>>>>> (mailto:juhani_connolly@cyberagent.co.jp)>  wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> Hello Bhaskar,
> >>>>>>>
> >>>>>>> Using Avro is generally the recommended method to handle multi-hop
> >>>>>>> flows,
> >>>>>>> so no concerns there.
> >>>>>>>
> >>>>>>> Have you tried this setup using memory channels instead of jdbc?
> Last
> >>>>>>> time
> >>>>>>> I tested it, the JDBC channel had poor throughput, so you may be
> >>>>>>> getting a
> >>>>>>> logjam somewhere. How much data is getting entered into your
> logfile?
> >>>>>>> Try
> >>>>>>> raising the capacity on your jdbc channel by a lot(10000?). With a
> >>>>>>> capacity
> >>>>>>> of 10, if the reading side(host b) isn't polling frequently enough,
> >>>>>>> there's
> >>>>>>> going to be problems. This is probably why you get the "failed to
> >>>>>>> persist
> >>>>>>> event". As far as FLUME-1259 is concerned, that should only be
> >>>>>>> happening if
> >>>>>>> bad data is being sent. You're not sending anything else to the
> same
> >>>>>>> port
> >>>>>>> are you? Make sure that only the source and sink are set to that
> port
> >>>>>>> and
> >>>>>>> that nothing else is.
> >>>>>>>
> >>>>>>> If the problem continues, please post a chunk of the logs leading
> up
> >>>>>>> to
> >>>>>>> the OOM error(the full trace for the cause should be enough)
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 06/16/2012 12:01 AM, Bhaskar wrote:
> >>>>>>>
> >>>>>>> Sorry to be a pest with stream of questions. I think i am going two
> >>>>>>> steps
> >>>>>>> forward and four steps back:-). After my first successful attempt,
> i
> >>>>>>> tried
> >>>>>>> running flume with the following flow:
> >>>>>>>
> >>>>>>> 1. HostA
> >>>>>>> -- Source is tail web server log
> >>>>>>> -- channel jdbc
> >>>>>>> -- sink is AVRO collection on Host B
> >>>>>>> Configuraiton:
> >>>>>>> agent3.sources = tail
> >>>>>>> agent3.sinks = avro-forward-sink
> >>>>>>> agent3.channels = jdbc-channel
> >>>>>>>
> >>>>>>> # Define source flow
> >>>>>>> agent3.sources.tail.type = exec
> >>>>>>> agent3.sources.tail.command = tail -f /common/log/access.log
> >>>>>>> agent3.sources.tail.channels = jdbc-channel
> >>>>>>>
> >>>>>>> # define the flow
> >>>>>>> agent3.sinks.avro-forward-sink.channel = jdbc-channel
> >>>>>>>
> >>>>>>> # avro sink properties
> >>>>>>> agent3.sources.avro-forward-sink.type = avro
> >>>>>>> agent3.sources.avro-forward-sink.hostname =<<IP Address>>
> >>>>>>> agent3.sources.avro-forward-sink.port =<<PORT>>
> >>>>>>>
> >>>>>>> # Define channels
> >>>>>>> agent3.channels.jdbc-channel.type = jdbc
> >>>>>>> agent3.channels.jdbc-channel.maximum.capacity = 10
> >>>>>>> agent3.channels.jdbc-channel.maximum.connections = 2
> >>>>>>>
> >>>>>>>
> >>>>>>> 2. HostB
> >>>>>>> -- Source is AVRO collection
> >>>>>>> -- channel is memory
> >>>>>>> -- sink is local file system
> >>>>>>>
> >>>>>>> Configuration:
> >>>>>>> # list sources, sinks and channels in the agent4
> >>>>>>> agent4.sources = avro-collection-source
> >>>>>>> agent4.sinks = svc_0_sink
> >>>>>>> agent4.channels = MemoryChannel-2
> >>>>>>>
> >>>>>>> # define the flow
> >>>>>>> agent4.sources.avro-collection-source.channels = MemoryChannel-2
> >>>>>>> agent4.sinks.svc_0_sink.channel = MemoryChannel-2
> >>>>>>>
> >>>>>>> # avro sink properties
> >>>>>>> agent4.sources.avro-collection-source.type = avro
> >>>>>>> agent4.sources.avro-collection-source.bind =<<IP Address>>
> >>>>>>> agent4.sources.avro-collection-source.port =<<PORT>>
> >>>>>>>
> >>>>>>> agent4.sinks.svc_0_sink.type = FILE_ROLL
> >>>>>>> agent4.sinks.svc_0_sink.sink.directory=/logs/agent4
> >>>>>>> agent4.sinks.svc_0_sink.rollInterval=600
> >>>>>>> agent4.sinks.svc_0_sink.channel = MemoryChannel-2
> >>>>>>>
> >>>>>>> agent4.channels.MemoryChannel-2.type = memory
> >>>>>>> agent4.channels.MemoryChannel-2.capacity = 100
> >>>>>>> agent4.channels.MemoryChannel-2.transactionCapacity = 10
> >>>>>>>
> >>>>>>>
> >>>>>>> Basically i am trying to tail a file on one host, stream it to
> >>>>>>> another
> >>>>>>> host running sink. During the trial run, the configuration is
> loaded
> >>>>>>> fine
> >>>>>>> and i see the channels created fine. I see an exception from the
> jdbc
> >>>>>>> channel first (Failed to persist event). I am getting a java heap
> >>>>>>> space OOM
> >>>>>>> exception from Host B when Host A attempts to write.
> >>>>>>>
> >>>>>>> 2012-06-15 10:31:44,503 WARN ipc.NettyServer: Unexpected exception
> >>>>>>> from
> >>>>>>> downstream.
> >>>>>>> java.lang.OutOfMemoryError: Java heap space
> >>>>>>>
> >>>>>>> This issue was already
> >>>>>>> reported https://issues.apache.org/jira/browse/FLUME-1259 but i am
> >>>>>>> not
> >>>>>>> sure
> >>>>>>> if there is a work around to this problem. I have couple questions:
> >>>>>>>
> >>>>>>> 1. Am i force fitting a wrong solution here using AVRO?
> >>>>>>> 2. if so, what would be a right solution for streaming data from
> Host
> >>>>>>> A
> >>>>>>> to Host B (or thru intermediaries)?
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Bhaskar
> >>>>>>>
> >>>>>>> On Thu, Jun 14, 2012 at 4:31 PM, Mohammad Tariq<dontariq@gmail.com
> >>>>>>> (mailto:dontariq@gmail.com)>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Since you are thinking of using multi-hop flow I would suggest to
> go
> >>>>>>>> for "JDBC Channel" as there is higher chance of error than
> >>>>>>>> single-hop
> >>>>>>>> flow and in JDBC Channel events are stored in a persistent storage
> >>>>>>>> that’s backed by a database. For detailed guidelines you can refer
> >>>>>>>> Flume 1.x User Guide at -
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> https://people.apache.org/~mpercy/flume/flume-1.2.0-incubating-SNAPSHOT/docs/FlumeUserGuide.html
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Mohammad Tariq
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Fri, Jun 15, 2012 at 12:46 AM, Bhaskar<bmarthi@gmail.com
> >>>>>>>> (mailto:bmarthi@gmail.com)>  wrote:
> >>>>>>>>>
> >>>>>>>>> Hi Mohammad,
> >>>>>>>>> Thanks for the pointer there. Do you think using a message queue
> >>>>>>>>> (like
> >>>>>>>>> rabbitmq) would be a choice of communication channel between each
> >>>>>>>>> hop?
> >>>>>>>>> i am
> >>>>>>>>> struggling to get a handle on how i need to configure my sink in
> >>>>>>>>> intermediary hops in a multi-hop flow. Appreciate any
> >>>>>>>>> guidance/examples.
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Bhaskar
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Thu, Jun 14, 2012 at 1:57 PM, Mohammad Tariq<
> dontariq@gmail.com
> >>>>>>>>> (mailto:dontariq@gmail.com)>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hello Bhaskar,
> >>>>>>>>>>
> >>>>>>>>>> That's great..And the best approach to stream logs depends
> >>>>>>>>>> upon
> >>>>>>>>>> the type of source you want to watch for..And by looking at your
> >>>>>>>>>> usecase, I would suggest to go for "multi-hop" flows where
> events
> >>>>>>>>>> travel through multiple agents before reaching the final
> >>>>>>>>>> destination.
> >>>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>> Mohammad Tariq
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Jun 14, 2012 at 10:48 PM, Bhaskar<bmarthi@gmail.com
> >>>>>>>>>> (mailto:bmarthi@gmail.com)>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> I know what i am missing:-) I missed connecting the sink with
> >>>>>>>>>>> the
> >>>>>>>>>>> channel.
> >>>>>>>>>>> My small POC works now and i am able to view the streamed logs.
> >>>>>>>>>>> Thank
> >>>>>>>>>>> you
> >>>>>>>>>>> all for the guidance and patience in answering all questions.
> >>>>>>>>>>> So,
> >>>>>>>>>>> whats
> >>>>>>>>>>> the
> >>>>>>>>>>> best approach to stream logs from other hosts? Basically my
> next
> >>>>>>>>>>> task
> >>>>>>>>>>> would
> >>>>>>>>>>> be to set up collector (sort of) model to stream logs to
> >>>>>>>>>>> intermediary
> >>>>>>>>>>> and
> >>>>>>>>>>> then stream from collector to a sink location. I'd appreciate
> >>>>>>>>>>> any
> >>>>>>>>>>> thoughts/guidance in this regard.
> >>>>>>>>>>>
> >>>>>>>>>>> Bhaskar
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Jun 14, 2012 at 12:52 PM, Bhaskar<bmarthi@gmail.com
> >>>>>>>>>>> (mailto:bmarthi@gmail.com)>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> For testing purposes, I tried with the following configuration
> >>>>>>>>>>>> without
> >>>>>>>>>>>> much luck. I see that the process started fine but it just
> does
> >>>>>>>>>>>> not
> >>>>>>>>>>>> write
> >>>>>>>>>>>> anything to the sink. I guess i am missing something here. Can
> >>>>>>>>>>>> one of
> >>>>>>>>>>>> you
> >>>>>>>>>>>> gurus take a look and suggest what i am doing wrong?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>> Bhaskar
> >>>>>>>>>>>>
> >>>>>>>>>>>> agent1.sources = tail
> >>>>>>>>>>>> agent1.channels = MemoryChannel-2
> >>>>>>>>>>>> agent1.sinks = svc_0_sink
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> agent1.sources.tail.type = exec
> >>>>>>>>>>>> agent1.sources.tail.command = tail -f /var/log/access.log
> >>>>>>>>>>>> agent1.sources.tail.channels = MemoryChannel-2
> >>>>>>>>>>>>
> >>>>>>>>>>>> agent1.sinks.svc_0_sink.type = FILE_ROLL
> >>>>>>>>>>>> agent1.sinks.svc_0_sink.sink.directory=/flume_runtime/logs
> >>>>>>>>>>>> agent1.sinks.svc_0_sink.rollInterval=0
> >>>>>>>>>>>>
> >>>>>>>>>>>> agent1.channels.MemoryChannel-2.type = memory
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Jun 14, 2012 at 4:26 AM, Guillaume Polaert
> >>>>>>>>>>>> <gpolaert@cyres.fr (mailto:gpolaert@cyres.fr)>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Bhaskar,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> This is the flume.conf (http://pastebin.com/WULgUuaf) what
> I'm
> >>>>>>>>>>>>> using.
> >>>>>>>>>>>>> I have an avro server on the hadoop-m host and one agent per
> >>>>>>>>>>>>> node
> >>>>>>>>>>>>> (slave
> >>>>>>>>>>>>> hosts). Each agent send the ouput of a exec command to avro
> >>>>>>>>>>>>> server.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Host1 : exec ->  memory ->  avro (sink)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Host2 : exec ->  memory ->  avro
> >>>>>>>>>>>>> MainHost :
> >>>>>>>>>>>>> avro
> >>>>>>>>>>>>> (source) ->  memory ->  rolling file (local FS)
> >>>>>>>>>>>>> ...
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Host3 : exec ->  memory ->  avro
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Use your own exec command to read Apache log.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Guillaume Polaert | Cyrès Conseil
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> De : Bhaskar [mailto:bmarthi@gmail.com]
> >>>>>>>>>>>>> Envoyé : mercredi 13 juin 2012 19:16
> >>>>>>>>>>>>> À : flume-user@incubator.apache.org
> >>>>>>>>>>>>> (mailto:flume-user@incubator.apache.org)
> >>>>>>>>>>>>> Objet : Newbee question about flume 1.2 set up
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Good Afternoon,
> >>>>>>>>>>>>> I am a newbee to flume and read thru limited documentation
> >>>>>>>>>>>>> available.
> >>>>>>>>>>>>> I
> >>>>>>>>>>>>> would like to set up the following to test out.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 1. Read apache access logs (as source)
> >>>>>>>>>>>>> 2. Use memory channel
> >>>>>>>>>>>>> 3. Write it to a NFS (or even local) file system
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Can some one help me with the necessary configuration. I am
> >>>>>>>>>>>>> having
> >>>>>>>>>>>>> difficult time to glean that information from available
> >>>>>>>>>>>>> documentation.
> >>>>>>>>>>>>> I am
> >>>>>>>>>>>>> sure someone has done such test before and i appreciate if
> you
> >>>>>>>>>>>>> can
> >>>>>>>>>>>>> pass on
> >>>>>>>>>>>>> that information. Secondly, I also would like to stream the
> >>>>>>>>>>>>> logs
> >>>>>>>>>>>>> to a
> >>>>>>>>>>>>> remote server. Is that a log4j configuration or do i need to
> >>>>>>>>>>>>> run
> >>>>>>>>>>>>> an
> >>>>>>>>>>>>> agent
> >>>>>>>>>>>>> on each host to do so? Any configuration examples would be of
> >>>>>>>>>>>>> great
> >>>>>>>>>>>>> help.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>> Bhaskar
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
> >>
> >
>

Re: Newbee question about flume 1.2 set up

Posted by Mohammad Tariq <do...@gmail.com>.
Great work guys..I am really thankful to you all.. It's working perfectly fine.

Regards,
    Mohammad Tariq


On Wed, Jun 20, 2012 at 9:10 AM, Juhani Connolly
<ju...@cyberagent.co.jp> wrote:
> That was just committed yesterday, so you will probably also need to pull
> the new trunk and rebuild.
>
> Setting it up is easy, just add to your source:
>
> agent.sources.mysource.interceptors = hostint
> agent.sources.mysource.interceptors.hostint.type =
> org.apache.flume.interceptor.HostInterceptor$Builder
> agent.sources.mysource.interceptors.hostint.preserveExisting = true
> agent.sources.mysource.interceptors.hostint.useIP = false
>
> This will add a header at the original host(without overwriting it after
> redirects), which can then be used in your path.
>
>
> On 06/20/2012 11:59 AM, Mike Percy wrote:
>>
>> Will just contributed an Interceptor to provide this out of the box:
>>
>> https://issues.apache.org/jira/browse/FLUME-1284
>>
>> Regards
>> Mike
>>
>>
>> On Tuesday, June 19, 2012 at 2:54 PM, Mohammad Tariq wrote:
>>
>>> There is no problem from your side..Have a look at this -
>>>
>>> http://mail-archives.apache.org/mod_mbox/incubator-flume-user/201206.mbox/%3CCAGPLoJKLthyoecEYnJRscahe8q6i4kKH1ADsL3qoQCAQo=ig9g@mail.gmail.com%3E
>>>
>>> Regards,
>>> Mohammad Tariq
>>>
>>>
>>> On Wed, Jun 20, 2012 at 1:42 AM, Bhaskar<bmarthi@gmail.com
>>> (mailto:bmarthi@gmail.com)>  wrote:
>>>>
>>>> Unfortunately, that part is not working as expected. Must be my mistake
>>>> somewhere in the configuration. Here is my sink configuration.
>>>>
>>>> agent4.sinks.svc_0_sink.type = FILE_ROLL
>>>> agent4.sinks.svc_0_sink.sink.directory=/var/logs/agent4/%{host}
>>>> agent4.sinks.svc_0_sink.rollInterval=5400
>>>> agent4.sinks.svc_0_sink.channel = MemoryChannel-2
>>>>
>>>> Any thoughts on how to define host specific directory/file?
>>>>
>>>> Bhaskar
>>>>
>>>> On Tue, Jun 19, 2012 at 10:01 AM, Mohammad Tariq<dontariq@gmail.com
>>>> (mailto:dontariq@gmail.com)>  wrote:
>>>>>
>>>>>
>>>>> Hello Bhaskar,
>>>>>
>>>>> That's great...And we can use "%{host}" as the escape
>>>>> sequence to prefix our filenames(am I getting you correctly???).And I
>>>>> am waiting anxiously for your guide as I am still a newbie..:-)
>>>>>
>>>>> Regards,
>>>>> Mohammad Tariq
>>>>>
>>>>>
>>>>> On Tue, Jun 19, 2012 at 7:16 PM, Bhaskar<bmarthi@gmail.com
>>>>> (mailto:bmarthi@gmail.com)>  wrote:
>>>>>>
>>>>>> Thank you guys for the responses. I actually was able to get around
>>>>>> this
>>>>>> problem by tinkering around with my setting. I finally ended up with a
>>>>>> capacity of 10000 and commented out transactionCapacity (i originally
>>>>>> set it
>>>>>> to 10) and it started working. Thanks for the insight. It took me a
>>>>>> bit of
>>>>>> time to figure out the inner workings of AVRO to get it to send data
>>>>>> in
>>>>>> correct format. So, i got over that hump:-). Here is my flow for POC.
>>>>>>
>>>>>> Host A agent -->  Source tail exec -->  AVRO Client Sink -->  jdbc
>>>>>> channel
>>>>>> (flume-ng avro-client -H<<Host>>  -p<<port>>  -F<<file to read>>
>>>>>> --conf
>>>>>> ../conf/)
>>>>>> Host B agent -->  Source tail exec -->  AVRO Client Sink -->  jdbc
>>>>>> channel
>>>>>> (flume-ng avro-client -H<<Host>>  -p<<port>>  -F<<file to read>>
>>>>>> --conf
>>>>>> ../conf/)
>>>>>> Host C agent -->  avro-collector source -->  file sink to local
>>>>>> directory
>>>>>> --
>>>>>> Memory channel
>>>>>>
>>>>>> The issue i am running into is, I am unable to uniquely identify the
>>>>>> source
>>>>>> of the log in the sink (means the log events from Host A and Host B
>>>>>> are
>>>>>> combined into the same log on the disk and mixed up). Is there a way
>>>>>> to
>>>>>> provide unique identifier from the source so that we can track the
>>>>>> origin of
>>>>>> the log? I am hoping to see in my sink log,
>>>>>>
>>>>>> Host A -- some log entry
>>>>>> Host B -- Some log entry etc
>>>>>>
>>>>>> Is this feasible or are there any alternative mechanisms to achieve
>>>>>> this? I
>>>>>> am putting together a new bee guide that might help answer some of
>>>>>> these
>>>>>> questions for others as i explore this architecture.
>>>>>>
>>>>>> As always thanks for your assistance,
>>>>>> Bhaskar
>>>>>>
>>>>>>
>>>>>> On Tue, Jun 19, 2012 at 2:59 AM, Juhani Connolly
>>>>>> <juhani_connolly@cyberagent.co.jp
>>>>>> (mailto:juhani_connolly@cyberagent.co.jp)>  wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hello Bhaskar,
>>>>>>>
>>>>>>> Using Avro is generally the recommended method to handle multi-hop
>>>>>>> flows,
>>>>>>> so no concerns there.
>>>>>>>
>>>>>>> Have you tried this setup using memory channels instead of jdbc? Last
>>>>>>> time
>>>>>>> I tested it, the JDBC channel had poor throughput, so you may be
>>>>>>> getting a
>>>>>>> logjam somewhere. How much data is getting entered into your logfile?
>>>>>>> Try
>>>>>>> raising the capacity on your jdbc channel by a lot(10000?). With a
>>>>>>> capacity
>>>>>>> of 10, if the reading side(host b) isn't polling frequently enough,
>>>>>>> there's
>>>>>>> going to be problems. This is probably why you get the "failed to
>>>>>>> persist
>>>>>>> event". As far as FLUME-1259 is concerned, that should only be
>>>>>>> happening if
>>>>>>> bad data is being sent. You're not sending anything else to the same
>>>>>>> port
>>>>>>> are you? Make sure that only the source and sink are set to that port
>>>>>>> and
>>>>>>> that nothing else is.
>>>>>>>
>>>>>>> If the problem continues, please post a chunk of the logs leading up
>>>>>>> to
>>>>>>> the OOM error(the full trace for the cause should be enough)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 06/16/2012 12:01 AM, Bhaskar wrote:
>>>>>>>
>>>>>>> Sorry to be a pest with stream of questions. I think i am going two
>>>>>>> steps
>>>>>>> forward and four steps back:-). After my first successful attempt, i
>>>>>>> tried
>>>>>>> running flume with the following flow:
>>>>>>>
>>>>>>> 1. HostA
>>>>>>> -- Source is tail web server log
>>>>>>> -- channel jdbc
>>>>>>> -- sink is AVRO collection on Host B
>>>>>>> Configuraiton:
>>>>>>> agent3.sources = tail
>>>>>>> agent3.sinks = avro-forward-sink
>>>>>>> agent3.channels = jdbc-channel
>>>>>>>
>>>>>>> # Define source flow
>>>>>>> agent3.sources.tail.type = exec
>>>>>>> agent3.sources.tail.command = tail -f /common/log/access.log
>>>>>>> agent3.sources.tail.channels = jdbc-channel
>>>>>>>
>>>>>>> # define the flow
>>>>>>> agent3.sinks.avro-forward-sink.channel = jdbc-channel
>>>>>>>
>>>>>>> # avro sink properties
>>>>>>> agent3.sources.avro-forward-sink.type = avro
>>>>>>> agent3.sources.avro-forward-sink.hostname =<<IP Address>>
>>>>>>> agent3.sources.avro-forward-sink.port =<<PORT>>
>>>>>>>
>>>>>>> # Define channels
>>>>>>> agent3.channels.jdbc-channel.type = jdbc
>>>>>>> agent3.channels.jdbc-channel.maximum.capacity = 10
>>>>>>> agent3.channels.jdbc-channel.maximum.connections = 2
>>>>>>>
>>>>>>>
>>>>>>> 2. HostB
>>>>>>> -- Source is AVRO collection
>>>>>>> -- channel is memory
>>>>>>> -- sink is local file system
>>>>>>>
>>>>>>> Configuration:
>>>>>>> # list sources, sinks and channels in the agent4
>>>>>>> agent4.sources = avro-collection-source
>>>>>>> agent4.sinks = svc_0_sink
>>>>>>> agent4.channels = MemoryChannel-2
>>>>>>>
>>>>>>> # define the flow
>>>>>>> agent4.sources.avro-collection-source.channels = MemoryChannel-2
>>>>>>> agent4.sinks.svc_0_sink.channel = MemoryChannel-2
>>>>>>>
>>>>>>> # avro sink properties
>>>>>>> agent4.sources.avro-collection-source.type = avro
>>>>>>> agent4.sources.avro-collection-source.bind =<<IP Address>>
>>>>>>> agent4.sources.avro-collection-source.port =<<PORT>>
>>>>>>>
>>>>>>> agent4.sinks.svc_0_sink.type = FILE_ROLL
>>>>>>> agent4.sinks.svc_0_sink.sink.directory=/logs/agent4
>>>>>>> agent4.sinks.svc_0_sink.rollInterval=600
>>>>>>> agent4.sinks.svc_0_sink.channel = MemoryChannel-2
>>>>>>>
>>>>>>> agent4.channels.MemoryChannel-2.type = memory
>>>>>>> agent4.channels.MemoryChannel-2.capacity = 100
>>>>>>> agent4.channels.MemoryChannel-2.transactionCapacity = 10
>>>>>>>
>>>>>>>
>>>>>>> Basically i am trying to tail a file on one host, stream it to
>>>>>>> another
>>>>>>> host running sink. During the trial run, the configuration is loaded
>>>>>>> fine
>>>>>>> and i see the channels created fine. I see an exception from the jdbc
>>>>>>> channel first (Failed to persist event). I am getting a java heap
>>>>>>> space OOM
>>>>>>> exception from Host B when Host A attempts to write.
>>>>>>>
>>>>>>> 2012-06-15 10:31:44,503 WARN ipc.NettyServer: Unexpected exception
>>>>>>> from
>>>>>>> downstream.
>>>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>>>
>>>>>>> This issue was already
>>>>>>> reported https://issues.apache.org/jira/browse/FLUME-1259 but i am
>>>>>>> not
>>>>>>> sure
>>>>>>> if there is a work around to this problem. I have couple questions:
>>>>>>>
>>>>>>> 1. Am i force fitting a wrong solution here using AVRO?
>>>>>>> 2. if so, what would be a right solution for streaming data from Host
>>>>>>> A
>>>>>>> to Host B (or thru intermediaries)?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Bhaskar
>>>>>>>
>>>>>>> On Thu, Jun 14, 2012 at 4:31 PM, Mohammad Tariq<dontariq@gmail.com
>>>>>>> (mailto:dontariq@gmail.com)>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Since you are thinking of using multi-hop flow I would suggest to go
>>>>>>>> for "JDBC Channel" as there is higher chance of error than
>>>>>>>> single-hop
>>>>>>>> flow and in JDBC Channel events are stored in a persistent storage
>>>>>>>> that’s backed by a database. For detailed guidelines you can refer
>>>>>>>> Flume 1.x User Guide at -
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> https://people.apache.org/~mpercy/flume/flume-1.2.0-incubating-SNAPSHOT/docs/FlumeUserGuide.html
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Mohammad Tariq
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jun 15, 2012 at 12:46 AM, Bhaskar<bmarthi@gmail.com
>>>>>>>> (mailto:bmarthi@gmail.com)>  wrote:
>>>>>>>>>
>>>>>>>>> Hi Mohammad,
>>>>>>>>> Thanks for the pointer there. Do you think using a message queue
>>>>>>>>> (like
>>>>>>>>> rabbitmq) would be a choice of communication channel between each
>>>>>>>>> hop?
>>>>>>>>> i am
>>>>>>>>> struggling to get a handle on how i need to configure my sink in
>>>>>>>>> intermediary hops in a multi-hop flow. Appreciate any
>>>>>>>>> guidance/examples.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Bhaskar
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jun 14, 2012 at 1:57 PM, Mohammad Tariq<dontariq@gmail.com
>>>>>>>>> (mailto:dontariq@gmail.com)>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hello Bhaskar,
>>>>>>>>>>
>>>>>>>>>> That's great..And the best approach to stream logs depends
>>>>>>>>>> upon
>>>>>>>>>> the type of source you want to watch for..And by looking at your
>>>>>>>>>> usecase, I would suggest to go for "multi-hop" flows where events
>>>>>>>>>> travel through multiple agents before reaching the final
>>>>>>>>>> destination.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Mohammad Tariq
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Jun 14, 2012 at 10:48 PM, Bhaskar<bmarthi@gmail.com
>>>>>>>>>> (mailto:bmarthi@gmail.com)>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> I know what i am missing:-) I missed connecting the sink with
>>>>>>>>>>> the
>>>>>>>>>>> channel.
>>>>>>>>>>> My small POC works now and i am able to view the streamed logs.
>>>>>>>>>>> Thank
>>>>>>>>>>> you
>>>>>>>>>>> all for the guidance and patience in answering all questions.
>>>>>>>>>>> So,
>>>>>>>>>>> whats
>>>>>>>>>>> the
>>>>>>>>>>> best approach to stream logs from other hosts? Basically my next
>>>>>>>>>>> task
>>>>>>>>>>> would
>>>>>>>>>>> be to set up collector (sort of) model to stream logs to
>>>>>>>>>>> intermediary
>>>>>>>>>>> and
>>>>>>>>>>> then stream from collector to a sink location. I'd appreciate
>>>>>>>>>>> any
>>>>>>>>>>> thoughts/guidance in this regard.
>>>>>>>>>>>
>>>>>>>>>>> Bhaskar
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jun 14, 2012 at 12:52 PM, Bhaskar<bmarthi@gmail.com
>>>>>>>>>>> (mailto:bmarthi@gmail.com)>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> For testing purposes, I tried with the following configuration
>>>>>>>>>>>> without
>>>>>>>>>>>> much luck. I see that the process started fine but it just does
>>>>>>>>>>>> not
>>>>>>>>>>>> write
>>>>>>>>>>>> anything to the sink. I guess i am missing something here. Can
>>>>>>>>>>>> one of
>>>>>>>>>>>> you
>>>>>>>>>>>> gurus take a look and suggest what i am doing wrong?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Bhaskar
>>>>>>>>>>>>
>>>>>>>>>>>> agent1.sources = tail
>>>>>>>>>>>> agent1.channels = MemoryChannel-2
>>>>>>>>>>>> agent1.sinks = svc_0_sink
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> agent1.sources.tail.type = exec
>>>>>>>>>>>> agent1.sources.tail.command = tail -f /var/log/access.log
>>>>>>>>>>>> agent1.sources.tail.channels = MemoryChannel-2
>>>>>>>>>>>>
>>>>>>>>>>>> agent1.sinks.svc_0_sink.type = FILE_ROLL
>>>>>>>>>>>> agent1.sinks.svc_0_sink.sink.directory=/flume_runtime/logs
>>>>>>>>>>>> agent1.sinks.svc_0_sink.rollInterval=0
>>>>>>>>>>>>
>>>>>>>>>>>> agent1.channels.MemoryChannel-2.type = memory
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jun 14, 2012 at 4:26 AM, Guillaume Polaert
>>>>>>>>>>>> <gpolaert@cyres.fr (mailto:gpolaert@cyres.fr)>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Bhaskar,
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is the flume.conf (http://pastebin.com/WULgUuaf) what I'm
>>>>>>>>>>>>> using.
>>>>>>>>>>>>> I have an avro server on the hadoop-m host and one agent per
>>>>>>>>>>>>> node
>>>>>>>>>>>>> (slave
>>>>>>>>>>>>> hosts). Each agent send the ouput of a exec command to avro
>>>>>>>>>>>>> server.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Host1 : exec ->  memory ->  avro (sink)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Host2 : exec ->  memory ->  avro
>>>>>>>>>>>>> MainHost :
>>>>>>>>>>>>> avro
>>>>>>>>>>>>> (source) ->  memory ->  rolling file (local FS)
>>>>>>>>>>>>> ...
>>>>>>>>>>>>>
>>>>>>>>>>>>> Host3 : exec ->  memory ->  avro
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Use your own exec command to read Apache log.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Guillaume Polaert | Cyrès Conseil
>>>>>>>>>>>>>
>>>>>>>>>>>>> De : Bhaskar [mailto:bmarthi@gmail.com]
>>>>>>>>>>>>> Envoyé : mercredi 13 juin 2012 19:16
>>>>>>>>>>>>> À : flume-user@incubator.apache.org
>>>>>>>>>>>>> (mailto:flume-user@incubator.apache.org)
>>>>>>>>>>>>> Objet : Newbee question about flume 1.2 set up
>>>>>>>>>>>>>
>>>>>>>>>>>>> Good Afternoon,
>>>>>>>>>>>>> I am a newbee to flume and read thru limited documentation
>>>>>>>>>>>>> available.
>>>>>>>>>>>>> I
>>>>>>>>>>>>> would like to set up the following to test out.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. Read apache access logs (as source)
>>>>>>>>>>>>> 2. Use memory channel
>>>>>>>>>>>>> 3. Write it to a NFS (or even local) file system
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can some one help me with the necessary configuration. I am
>>>>>>>>>>>>> having
>>>>>>>>>>>>> difficult time to glean that information from available
>>>>>>>>>>>>> documentation.
>>>>>>>>>>>>> I am
>>>>>>>>>>>>> sure someone has done such test before and i appreciate if you
>>>>>>>>>>>>> can
>>>>>>>>>>>>> pass on
>>>>>>>>>>>>> that information. Secondly, I also would like to stream the
>>>>>>>>>>>>> logs
>>>>>>>>>>>>> to a
>>>>>>>>>>>>> remote server. Is that a log4j configuration or do i need to
>>>>>>>>>>>>> run
>>>>>>>>>>>>> an
>>>>>>>>>>>>> agent
>>>>>>>>>>>>> on each host to do so? Any configuration examples would be of
>>>>>>>>>>>>> great
>>>>>>>>>>>>> help.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Bhaskar
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>>
>

Re: Newbee question about flume 1.2 set up

Posted by Juhani Connolly <ju...@cyberagent.co.jp>.
That was just committed yesterday, so you will probably also need to 
pull the new trunk and rebuild.

Setting it up is easy, just add to your source:

agent.sources.mysource.interceptors = hostint
agent.sources.mysource.interceptors.hostint.type = 
org.apache.flume.interceptor.HostInterceptor$Builder
agent.sources.mysource.interceptors.hostint.preserveExisting = true
agent.sources.mysource.interceptors.hostint.useIP = false

This will add a header at the original host(without overwriting it after 
redirects), which can then be used in your path.

On 06/20/2012 11:59 AM, Mike Percy wrote:
> Will just contributed an Interceptor to provide this out of the box:
>
> https://issues.apache.org/jira/browse/FLUME-1284
>
> Regards
> Mike
>
>
> On Tuesday, June 19, 2012 at 2:54 PM, Mohammad Tariq wrote:
>
>> There is no problem from your side..Have a look at this -
>> http://mail-archives.apache.org/mod_mbox/incubator-flume-user/201206.mbox/%3CCAGPLoJKLthyoecEYnJRscahe8q6i4kKH1ADsL3qoQCAQo=ig9g@mail.gmail.com%3E
>>
>> Regards,
>> Mohammad Tariq
>>
>>
>> On Wed, Jun 20, 2012 at 1:42 AM, Bhaskar<bmarthi@gmail.com (mailto:bmarthi@gmail.com)>  wrote:
>>> Unfortunately, that part is not working as expected. Must be my mistake
>>> somewhere in the configuration. Here is my sink configuration.
>>>
>>> agent4.sinks.svc_0_sink.type = FILE_ROLL
>>> agent4.sinks.svc_0_sink.sink.directory=/var/logs/agent4/%{host}
>>> agent4.sinks.svc_0_sink.rollInterval=5400
>>> agent4.sinks.svc_0_sink.channel = MemoryChannel-2
>>>
>>> Any thoughts on how to define host specific directory/file?
>>>
>>> Bhaskar
>>>
>>> On Tue, Jun 19, 2012 at 10:01 AM, Mohammad Tariq<dontariq@gmail.com (mailto:dontariq@gmail.com)>  wrote:
>>>>
>>>> Hello Bhaskar,
>>>>
>>>> That's great...And we can use "%{host}" as the escape
>>>> sequence to prefix our filenames(am I getting you correctly???).And I
>>>> am waiting anxiously for your guide as I am still a newbie..:-)
>>>>
>>>> Regards,
>>>> Mohammad Tariq
>>>>
>>>>
>>>> On Tue, Jun 19, 2012 at 7:16 PM, Bhaskar<bmarthi@gmail.com (mailto:bmarthi@gmail.com)>  wrote:
>>>>> Thank you guys for the responses. I actually was able to get around
>>>>> this
>>>>> problem by tinkering around with my setting. I finally ended up with a
>>>>> capacity of 10000 and commented out transactionCapacity (i originally
>>>>> set it
>>>>> to 10) and it started working. Thanks for the insight. It took me a
>>>>> bit of
>>>>> time to figure out the inner workings of AVRO to get it to send data in
>>>>> correct format. So, i got over that hump:-). Here is my flow for POC.
>>>>>
>>>>> Host A agent -->  Source tail exec -->  AVRO Client Sink -->  jdbc channel
>>>>> (flume-ng avro-client -H<<Host>>  -p<<port>>  -F<<file to read>>
>>>>> --conf
>>>>> ../conf/)
>>>>> Host B agent -->  Source tail exec -->  AVRO Client Sink -->  jdbc channel
>>>>> (flume-ng avro-client -H<<Host>>  -p<<port>>  -F<<file to read>>
>>>>> --conf
>>>>> ../conf/)
>>>>> Host C agent -->  avro-collector source -->  file sink to local directory
>>>>> --
>>>>> Memory channel
>>>>>
>>>>> The issue i am running into is, I am unable to uniquely identify the
>>>>> source
>>>>> of the log in the sink (means the log events from Host A and Host B are
>>>>> combined into the same log on the disk and mixed up). Is there a way to
>>>>> provide unique identifier from the source so that we can track the
>>>>> origin of
>>>>> the log? I am hoping to see in my sink log,
>>>>>
>>>>> Host A -- some log entry
>>>>> Host B -- Some log entry etc
>>>>>
>>>>> Is this feasible or are there any alternative mechanisms to achieve
>>>>> this? I
>>>>> am putting together a new bee guide that might help answer some of these
>>>>> questions for others as i explore this architecture.
>>>>>
>>>>> As always thanks for your assistance,
>>>>> Bhaskar
>>>>>
>>>>>
>>>>> On Tue, Jun 19, 2012 at 2:59 AM, Juhani Connolly
>>>>> <juhani_connolly@cyberagent.co.jp (mailto:juhani_connolly@cyberagent.co.jp)>  wrote:
>>>>>>
>>>>>> Hello Bhaskar,
>>>>>>
>>>>>> Using Avro is generally the recommended method to handle multi-hop
>>>>>> flows,
>>>>>> so no concerns there.
>>>>>>
>>>>>> Have you tried this setup using memory channels instead of jdbc? Last
>>>>>> time
>>>>>> I tested it, the JDBC channel had poor throughput, so you may be
>>>>>> getting a
>>>>>> logjam somewhere. How much data is getting entered into your logfile?
>>>>>> Try
>>>>>> raising the capacity on your jdbc channel by a lot(10000?). With a
>>>>>> capacity
>>>>>> of 10, if the reading side(host b) isn't polling frequently enough,
>>>>>> there's
>>>>>> going to be problems. This is probably why you get the "failed to
>>>>>> persist
>>>>>> event". As far as FLUME-1259 is concerned, that should only be
>>>>>> happening if
>>>>>> bad data is being sent. You're not sending anything else to the same
>>>>>> port
>>>>>> are you? Make sure that only the source and sink are set to that port
>>>>>> and
>>>>>> that nothing else is.
>>>>>>
>>>>>> If the problem continues, please post a chunk of the logs leading up to
>>>>>> the OOM error(the full trace for the cause should be enough)
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 06/16/2012 12:01 AM, Bhaskar wrote:
>>>>>>
>>>>>> Sorry to be a pest with stream of questions. I think i am going two
>>>>>> steps
>>>>>> forward and four steps back:-). After my first successful attempt, i
>>>>>> tried
>>>>>> running flume with the following flow:
>>>>>>
>>>>>> 1. HostA
>>>>>> -- Source is tail web server log
>>>>>> -- channel jdbc
>>>>>> -- sink is AVRO collection on Host B
>>>>>> Configuraiton:
>>>>>> agent3.sources = tail
>>>>>> agent3.sinks = avro-forward-sink
>>>>>> agent3.channels = jdbc-channel
>>>>>>
>>>>>> # Define source flow
>>>>>> agent3.sources.tail.type = exec
>>>>>> agent3.sources.tail.command = tail -f /common/log/access.log
>>>>>> agent3.sources.tail.channels = jdbc-channel
>>>>>>
>>>>>> # define the flow
>>>>>> agent3.sinks.avro-forward-sink.channel = jdbc-channel
>>>>>>
>>>>>> # avro sink properties
>>>>>> agent3.sources.avro-forward-sink.type = avro
>>>>>> agent3.sources.avro-forward-sink.hostname =<<IP Address>>
>>>>>> agent3.sources.avro-forward-sink.port =<<PORT>>
>>>>>>
>>>>>> # Define channels
>>>>>> agent3.channels.jdbc-channel.type = jdbc
>>>>>> agent3.channels.jdbc-channel.maximum.capacity = 10
>>>>>> agent3.channels.jdbc-channel.maximum.connections = 2
>>>>>>
>>>>>>
>>>>>> 2. HostB
>>>>>> -- Source is AVRO collection
>>>>>> -- channel is memory
>>>>>> -- sink is local file system
>>>>>>
>>>>>> Configuration:
>>>>>> # list sources, sinks and channels in the agent4
>>>>>> agent4.sources = avro-collection-source
>>>>>> agent4.sinks = svc_0_sink
>>>>>> agent4.channels = MemoryChannel-2
>>>>>>
>>>>>> # define the flow
>>>>>> agent4.sources.avro-collection-source.channels = MemoryChannel-2
>>>>>> agent4.sinks.svc_0_sink.channel = MemoryChannel-2
>>>>>>
>>>>>> # avro sink properties
>>>>>> agent4.sources.avro-collection-source.type = avro
>>>>>> agent4.sources.avro-collection-source.bind =<<IP Address>>
>>>>>> agent4.sources.avro-collection-source.port =<<PORT>>
>>>>>>
>>>>>> agent4.sinks.svc_0_sink.type = FILE_ROLL
>>>>>> agent4.sinks.svc_0_sink.sink.directory=/logs/agent4
>>>>>> agent4.sinks.svc_0_sink.rollInterval=600
>>>>>> agent4.sinks.svc_0_sink.channel = MemoryChannel-2
>>>>>>
>>>>>> agent4.channels.MemoryChannel-2.type = memory
>>>>>> agent4.channels.MemoryChannel-2.capacity = 100
>>>>>> agent4.channels.MemoryChannel-2.transactionCapacity = 10
>>>>>>
>>>>>>
>>>>>> Basically i am trying to tail a file on one host, stream it to another
>>>>>> host running sink. During the trial run, the configuration is loaded
>>>>>> fine
>>>>>> and i see the channels created fine. I see an exception from the jdbc
>>>>>> channel first (Failed to persist event). I am getting a java heap
>>>>>> space OOM
>>>>>> exception from Host B when Host A attempts to write.
>>>>>>
>>>>>> 2012-06-15 10:31:44,503 WARN ipc.NettyServer: Unexpected exception from
>>>>>> downstream.
>>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>>
>>>>>> This issue was already
>>>>>> reported https://issues.apache.org/jira/browse/FLUME-1259 but i am not
>>>>>> sure
>>>>>> if there is a work around to this problem. I have couple questions:
>>>>>>
>>>>>> 1. Am i force fitting a wrong solution here using AVRO?
>>>>>> 2. if so, what would be a right solution for streaming data from Host
>>>>>> A
>>>>>> to Host B (or thru intermediaries)?
>>>>>>
>>>>>> Thanks,
>>>>>> Bhaskar
>>>>>>
>>>>>> On Thu, Jun 14, 2012 at 4:31 PM, Mohammad Tariq<dontariq@gmail.com (mailto:dontariq@gmail.com)>
>>>>>> wrote:
>>>>>>>
>>>>>>> Since you are thinking of using multi-hop flow I would suggest to go
>>>>>>> for "JDBC Channel" as there is higher chance of error than single-hop
>>>>>>> flow and in JDBC Channel events are stored in a persistent storage
>>>>>>> that’s backed by a database. For detailed guidelines you can refer
>>>>>>> Flume 1.x User Guide at -
>>>>>>>
>>>>>>>
>>>>>>> https://people.apache.org/~mpercy/flume/flume-1.2.0-incubating-SNAPSHOT/docs/FlumeUserGuide.html
>>>>>>>
>>>>>>> Regards,
>>>>>>> Mohammad Tariq
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jun 15, 2012 at 12:46 AM, Bhaskar<bmarthi@gmail.com (mailto:bmarthi@gmail.com)>  wrote:
>>>>>>>> Hi Mohammad,
>>>>>>>> Thanks for the pointer there. Do you think using a message queue
>>>>>>>> (like
>>>>>>>> rabbitmq) would be a choice of communication channel between each
>>>>>>>> hop?
>>>>>>>> i am
>>>>>>>> struggling to get a handle on how i need to configure my sink in
>>>>>>>> intermediary hops in a multi-hop flow. Appreciate any
>>>>>>>> guidance/examples.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Bhaskar
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jun 14, 2012 at 1:57 PM, Mohammad Tariq<dontariq@gmail.com (mailto:dontariq@gmail.com)>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hello Bhaskar,
>>>>>>>>>
>>>>>>>>> That's great..And the best approach to stream logs depends
>>>>>>>>> upon
>>>>>>>>> the type of source you want to watch for..And by looking at your
>>>>>>>>> usecase, I would suggest to go for "multi-hop" flows where events
>>>>>>>>> travel through multiple agents before reaching the final
>>>>>>>>> destination.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Mohammad Tariq
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jun 14, 2012 at 10:48 PM, Bhaskar<bmarthi@gmail.com (mailto:bmarthi@gmail.com)>
>>>>>>>>> wrote:
>>>>>>>>>> I know what i am missing:-) I missed connecting the sink with
>>>>>>>>>> the
>>>>>>>>>> channel.
>>>>>>>>>> My small POC works now and i am able to view the streamed logs.
>>>>>>>>>> Thank
>>>>>>>>>> you
>>>>>>>>>> all for the guidance and patience in answering all questions.
>>>>>>>>>> So,
>>>>>>>>>> whats
>>>>>>>>>> the
>>>>>>>>>> best approach to stream logs from other hosts? Basically my next
>>>>>>>>>> task
>>>>>>>>>> would
>>>>>>>>>> be to set up collector (sort of) model to stream logs to
>>>>>>>>>> intermediary
>>>>>>>>>> and
>>>>>>>>>> then stream from collector to a sink location. I'd appreciate
>>>>>>>>>> any
>>>>>>>>>> thoughts/guidance in this regard.
>>>>>>>>>>
>>>>>>>>>> Bhaskar
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Jun 14, 2012 at 12:52 PM, Bhaskar<bmarthi@gmail.com (mailto:bmarthi@gmail.com)>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> For testing purposes, I tried with the following configuration
>>>>>>>>>>> without
>>>>>>>>>>> much luck. I see that the process started fine but it just does
>>>>>>>>>>> not
>>>>>>>>>>> write
>>>>>>>>>>> anything to the sink. I guess i am missing something here. Can
>>>>>>>>>>> one of
>>>>>>>>>>> you
>>>>>>>>>>> gurus take a look and suggest what i am doing wrong?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Bhaskar
>>>>>>>>>>>
>>>>>>>>>>> agent1.sources = tail
>>>>>>>>>>> agent1.channels = MemoryChannel-2
>>>>>>>>>>> agent1.sinks = svc_0_sink
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> agent1.sources.tail.type = exec
>>>>>>>>>>> agent1.sources.tail.command = tail -f /var/log/access.log
>>>>>>>>>>> agent1.sources.tail.channels = MemoryChannel-2
>>>>>>>>>>>
>>>>>>>>>>> agent1.sinks.svc_0_sink.type = FILE_ROLL
>>>>>>>>>>> agent1.sinks.svc_0_sink.sink.directory=/flume_runtime/logs
>>>>>>>>>>> agent1.sinks.svc_0_sink.rollInterval=0
>>>>>>>>>>>
>>>>>>>>>>> agent1.channels.MemoryChannel-2.type = memory
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jun 14, 2012 at 4:26 AM, Guillaume Polaert
>>>>>>>>>>> <gpolaert@cyres.fr (mailto:gpolaert@cyres.fr)>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Bhaskar,
>>>>>>>>>>>>
>>>>>>>>>>>> This is the flume.conf (http://pastebin.com/WULgUuaf) what I'm
>>>>>>>>>>>> using.
>>>>>>>>>>>> I have an avro server on the hadoop-m host and one agent per
>>>>>>>>>>>> node
>>>>>>>>>>>> (slave
>>>>>>>>>>>> hosts). Each agent send the ouput of a exec command to avro
>>>>>>>>>>>> server.
>>>>>>>>>>>>
>>>>>>>>>>>> Host1 : exec ->  memory ->  avro (sink)
>>>>>>>>>>>>
>>>>>>>>>>>> Host2 : exec ->  memory ->  avro
>>>>>>>>>>>> MainHost :
>>>>>>>>>>>> avro
>>>>>>>>>>>> (source) ->  memory ->  rolling file (local FS)
>>>>>>>>>>>> ...
>>>>>>>>>>>>
>>>>>>>>>>>> Host3 : exec ->  memory ->  avro
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Use your own exec command to read Apache log.
>>>>>>>>>>>>
>>>>>>>>>>>> Guillaume Polaert | Cyrès Conseil
>>>>>>>>>>>>
>>>>>>>>>>>> De : Bhaskar [mailto:bmarthi@gmail.com]
>>>>>>>>>>>> Envoyé : mercredi 13 juin 2012 19:16
>>>>>>>>>>>> À : flume-user@incubator.apache.org (mailto:flume-user@incubator.apache.org)
>>>>>>>>>>>> Objet : Newbee question about flume 1.2 set up
>>>>>>>>>>>>
>>>>>>>>>>>> Good Afternoon,
>>>>>>>>>>>> I am a newbee to flume and read thru limited documentation
>>>>>>>>>>>> available.
>>>>>>>>>>>> I
>>>>>>>>>>>> would like to set up the following to test out.
>>>>>>>>>>>>
>>>>>>>>>>>> 1. Read apache access logs (as source)
>>>>>>>>>>>> 2. Use memory channel
>>>>>>>>>>>> 3. Write it to a NFS (or even local) file system
>>>>>>>>>>>>
>>>>>>>>>>>> Can some one help me with the necessary configuration. I am
>>>>>>>>>>>> having
>>>>>>>>>>>> difficult time to glean that information from available
>>>>>>>>>>>> documentation.
>>>>>>>>>>>> I am
>>>>>>>>>>>> sure someone has done such test before and i appreciate if you
>>>>>>>>>>>> can
>>>>>>>>>>>> pass on
>>>>>>>>>>>> that information. Secondly, I also would like to stream the
>>>>>>>>>>>> logs
>>>>>>>>>>>> to a
>>>>>>>>>>>> remote server. Is that a log4j configuration or do i need to
>>>>>>>>>>>> run
>>>>>>>>>>>> an
>>>>>>>>>>>> agent
>>>>>>>>>>>> on each host to do so? Any configuration examples would be of
>>>>>>>>>>>> great
>>>>>>>>>>>> help.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Bhaskar
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
>


Re: Newbee question about flume 1.2 set up

Posted by Bhaskar <bm...@gmail.com>.
yes.  and i will open a ticket with that request.

Thanks,
Bhaskar

On Wed, Jun 20, 2012 at 1:12 PM, Will McQueen <wi...@cloudera.com> wrote:

> Hi Bhaskar,
>
> Are you requesting that the RollingFileSink be able to construct the file
> name according to escape sequences like what can be done with
> HDFSEventSink? Please open a ticket with your request.
>
> Cheers,
> Will
>
>
> On Jun 20, 2012, at 9:04 AM, Bhaskar <bm...@gmail.com> wrote:
>
> I wish similar patch got applied to RollingFileSink as well.  Currently
> RollingFileSink does not reconstruct the path based on place holder values.
>  This patch was done to HDFSEventSink.  Do you think i should request that
> thru Jira?
>
> Bhaskar
>
> On Tue, Jun 19, 2012 at 10:59 PM, Mike Percy <mp...@cloudera.com> wrote:
>
>> Will just contributed an Interceptor to provide this out of the box:
>>
>> https://issues.apache.org/jira/browse/FLUME-1284
>>
>> Regards
>> Mike
>>
>>
>> On Tuesday, June 19, 2012 at 2:54 PM, Mohammad Tariq wrote:
>>
>> > There is no problem from your side..Have a look at this -
>> >
>> http://mail-archives.apache.org/mod_mbox/incubator-flume-user/201206.mbox/%3CCAGPLoJKLthyoecEYnJRscahe8q6i4kKH1ADsL3qoQCAQo=ig9g@mail.gmail.com%3E
>> >
>> > Regards,
>> > Mohammad Tariq
>> >
>> >
>> > On Wed, Jun 20, 2012 at 1:42 AM, Bhaskar <bmarthi@gmail.com (mailto:
>> bmarthi@gmail.com)> wrote:
>> > > Unfortunately, that part is not working as expected. Must be my
>> mistake
>> > > somewhere in the configuration. Here is my sink configuration.
>> > >
>> > > agent4.sinks.svc_0_sink.type = FILE_ROLL
>> > > agent4.sinks.svc_0_sink.sink.directory=/var/logs/agent4/%{host}
>> > > agent4.sinks.svc_0_sink.rollInterval=5400
>> > > agent4.sinks.svc_0_sink.channel = MemoryChannel-2
>> > >
>> > > Any thoughts on how to define host specific directory/file?
>> > >
>> > > Bhaskar
>> > >
>> > > On Tue, Jun 19, 2012 at 10:01 AM, Mohammad Tariq <dontariq@gmail.com(mailto:
>> dontariq@gmail.com)> wrote:
>> > > >
>> > > > Hello Bhaskar,
>> > > >
>> > > > That's great...And we can use "%{host}" as the escape
>> > > > sequence to prefix our filenames(am I getting you correctly???).And
>> I
>> > > > am waiting anxiously for your guide as I am still a newbie..:-)
>> > > >
>> > > > Regards,
>> > > > Mohammad Tariq
>> > > >
>> > > >
>> > > > On Tue, Jun 19, 2012 at 7:16 PM, Bhaskar <bmarthi@gmail.com(mailto:
>> bmarthi@gmail.com)> wrote:
>> > > > > Thank you guys for the responses. I actually was able to get
>> around
>> > > > > this
>> > > > > problem by tinkering around with my setting. I finally ended up
>> with a
>> > > > > capacity of 10000 and commented out transactionCapacity (i
>> originally
>> > > > > set it
>> > > > > to 10) and it started working. Thanks for the insight. It took me
>> a
>> > > > > bit of
>> > > > > time to figure out the inner workings of AVRO to get it to send
>> data in
>> > > > > correct format. So, i got over that hump:-). Here is my flow for
>> POC.
>> > > > >
>> > > > > Host A agent --> Source tail exec --> AVRO Client Sink --> jdbc
>> channel
>> > > > > (flume-ng avro-client -H <<Host>> -p <<port>> -F <<file to read>>
>> > > > > --conf
>> > > > > ../conf/)
>> > > > > Host B agent --> Source tail exec --> AVRO Client Sink --> jdbc
>> channel
>> > > > > (flume-ng avro-client -H <<Host>> -p <<port>> -F <<file to read>>
>> > > > > --conf
>> > > > > ../conf/)
>> > > > > Host C agent --> avro-collector source --> file sink to local
>> directory
>> > > > > --
>> > > > > Memory channel
>> > > > >
>> > > > > The issue i am running into is, I am unable to uniquely identify
>> the
>> > > > > source
>> > > > > of the log in the sink (means the log events from Host A and Host
>> B are
>> > > > > combined into the same log on the disk and mixed up). Is there a
>> way to
>> > > > > provide unique identifier from the source so that we can track the
>> > > > > origin of
>> > > > > the log? I am hoping to see in my sink log,
>> > > > >
>> > > > > Host A -- some log entry
>> > > > > Host B -- Some log entry etc
>> > > > >
>> > > > > Is this feasible or are there any alternative mechanisms to
>> achieve
>> > > > > this? I
>> > > > > am putting together a new bee guide that might help answer some
>> of these
>> > > > > questions for others as i explore this architecture.
>> > > > >
>> > > > > As always thanks for your assistance,
>> > > > > Bhaskar
>> > > > >
>> > > > >
>> > > > > On Tue, Jun 19, 2012 at 2:59 AM, Juhani Connolly
>> > > > > <juhani_connolly@cyberagent.co.jp (mailto:
>> juhani_connolly@cyberagent.co.jp)> wrote:
>> > > > > >
>> > > > > > Hello Bhaskar,
>> > > > > >
>> > > > > > Using Avro is generally the recommended method to handle
>> multi-hop
>> > > > > > flows,
>> > > > > > so no concerns there.
>> > > > > >
>> > > > > > Have you tried this setup using memory channels instead of
>> jdbc? Last
>> > > > > > time
>> > > > > > I tested it, the JDBC channel had poor throughput, so you may be
>> > > > > > getting a
>> > > > > > logjam somewhere. How much data is getting entered into your
>> logfile?
>> > > > > > Try
>> > > > > > raising the capacity on your jdbc channel by a lot(10000?).
>> With a
>> > > > > > capacity
>> > > > > > of 10, if the reading side(host b) isn't polling frequently
>> enough,
>> > > > > > there's
>> > > > > > going to be problems. This is probably why you get the "failed
>> to
>> > > > > > persist
>> > > > > > event". As far as FLUME-1259 is concerned, that should only be
>> > > > > > happening if
>> > > > > > bad data is being sent. You're not sending anything else to the
>> same
>> > > > > > port
>> > > > > > are you? Make sure that only the source and sink are set to
>> that port
>> > > > > > and
>> > > > > > that nothing else is.
>> > > > > >
>> > > > > > If the problem continues, please post a chunk of the logs
>> leading up to
>> > > > > > the OOM error(the full trace for the cause should be enough)
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > On 06/16/2012 12:01 AM, Bhaskar wrote:
>> > > > > >
>> > > > > > Sorry to be a pest with stream of questions. I think i am going
>> two
>> > > > > > steps
>> > > > > > forward and four steps back:-). After my first successful
>> attempt, i
>> > > > > > tried
>> > > > > > running flume with the following flow:
>> > > > > >
>> > > > > > 1. HostA
>> > > > > > -- Source is tail web server log
>> > > > > > -- channel jdbc
>> > > > > > -- sink is AVRO collection on Host B
>> > > > > > Configuraiton:
>> > > > > > agent3.sources = tail
>> > > > > > agent3.sinks = avro-forward-sink
>> > > > > > agent3.channels = jdbc-channel
>> > > > > >
>> > > > > > # Define source flow
>> > > > > > agent3.sources.tail.type = exec
>> > > > > > agent3.sources.tail.command = tail -f /common/log/access.log
>> > > > > > agent3.sources.tail.channels = jdbc-channel
>> > > > > >
>> > > > > > # define the flow
>> > > > > > agent3.sinks.avro-forward-sink.channel = jdbc-channel
>> > > > > >
>> > > > > > # avro sink properties
>> > > > > > agent3.sources.avro-forward-sink.type = avro
>> > > > > > agent3.sources.avro-forward-sink.hostname = <<IP Address>>
>> > > > > > agent3.sources.avro-forward-sink.port = <<PORT>>
>> > > > > >
>> > > > > > # Define channels
>> > > > > > agent3.channels.jdbc-channel.type = jdbc
>> > > > > > agent3.channels.jdbc-channel.maximum.capacity = 10
>> > > > > > agent3.channels.jdbc-channel.maximum.connections = 2
>> > > > > >
>> > > > > >
>> > > > > > 2. HostB
>> > > > > > -- Source is AVRO collection
>> > > > > > -- channel is memory
>> > > > > > -- sink is local file system
>> > > > > >
>> > > > > > Configuration:
>> > > > > > # list sources, sinks and channels in the agent4
>> > > > > > agent4.sources = avro-collection-source
>> > > > > > agent4.sinks = svc_0_sink
>> > > > > > agent4.channels = MemoryChannel-2
>> > > > > >
>> > > > > > # define the flow
>> > > > > > agent4.sources.avro-collection-source.channels = MemoryChannel-2
>> > > > > > agent4.sinks.svc_0_sink.channel = MemoryChannel-2
>> > > > > >
>> > > > > > # avro sink properties
>> > > > > > agent4.sources.avro-collection-source.type = avro
>> > > > > > agent4.sources.avro-collection-source.bind = <<IP Address>>
>> > > > > > agent4.sources.avro-collection-source.port = <<PORT>>
>> > > > > >
>> > > > > > agent4.sinks.svc_0_sink.type = FILE_ROLL
>> > > > > > agent4.sinks.svc_0_sink.sink.directory=/logs/agent4
>> > > > > > agent4.sinks.svc_0_sink.rollInterval=600
>> > > > > > agent4.sinks.svc_0_sink.channel = MemoryChannel-2
>> > > > > >
>> > > > > > agent4.channels.MemoryChannel-2.type = memory
>> > > > > > agent4.channels.MemoryChannel-2.capacity = 100
>> > > > > > agent4.channels.MemoryChannel-2.transactionCapacity = 10
>> > > > > >
>> > > > > >
>> > > > > > Basically i am trying to tail a file on one host, stream it to
>> another
>> > > > > > host running sink. During the trial run, the configuration is
>> loaded
>> > > > > > fine
>> > > > > > and i see the channels created fine. I see an exception from
>> the jdbc
>> > > > > > channel first (Failed to persist event). I am getting a java
>> heap
>> > > > > > space OOM
>> > > > > > exception from Host B when Host A attempts to write.
>> > > > > >
>> > > > > > 2012-06-15 10:31:44,503 WARN ipc.NettyServer: Unexpected
>> exception from
>> > > > > > downstream.
>> > > > > > java.lang.OutOfMemoryError: Java heap space
>> > > > > >
>> > > > > > This issue was already
>> > > > > > reported https://issues.apache.org/jira/browse/FLUME-1259 but
>> i am not
>> > > > > > sure
>> > > > > > if there is a work around to this problem. I have couple
>> questions:
>> > > > > >
>> > > > > > 1. Am i force fitting a wrong solution here using AVRO?
>> > > > > > 2. if so, what would be a right solution for streaming data
>> from Host
>> > > > > > A
>> > > > > > to Host B (or thru intermediaries)?
>> > > > > >
>> > > > > > Thanks,
>> > > > > > Bhaskar
>> > > > > >
>> > > > > > On Thu, Jun 14, 2012 at 4:31 PM, Mohammad Tariq <
>> dontariq@gmail.com (mailto:dontariq@gmail.com)>
>> > > > > > wrote:
>> > > > > > >
>> > > > > > > Since you are thinking of using multi-hop flow I would
>> suggest to go
>> > > > > > > for "JDBC Channel" as there is higher chance of error than
>> single-hop
>> > > > > > > flow and in JDBC Channel events are stored in a persistent
>> storage
>> > > > > > > that’s backed by a database. For detailed guidelines you can
>> refer
>> > > > > > > Flume 1.x User Guide at -
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> https://people.apache.org/~mpercy/flume/flume-1.2.0-incubating-SNAPSHOT/docs/FlumeUserGuide.html
>> > > > > > >
>> > > > > > > Regards,
>> > > > > > > Mohammad Tariq
>> > > > > > >
>> > > > > > >
>> > > > > > > On Fri, Jun 15, 2012 at 12:46 AM, Bhaskar <bmarthi@gmail.com(mailto:
>> bmarthi@gmail.com)> wrote:
>> > > > > > > > Hi Mohammad,
>> > > > > > > > Thanks for the pointer there. Do you think using a message
>> queue
>> > > > > > > > (like
>> > > > > > > > rabbitmq) would be a choice of communication channel
>> between each
>> > > > > > > > hop?
>> > > > > > > > i am
>> > > > > > > > struggling to get a handle on how i need to configure my
>> sink in
>> > > > > > > > intermediary hops in a multi-hop flow. Appreciate any
>> > > > > > > > guidance/examples.
>> > > > > > > >
>> > > > > > > > Thanks,
>> > > > > > > > Bhaskar
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > On Thu, Jun 14, 2012 at 1:57 PM, Mohammad Tariq <
>> dontariq@gmail.com (mailto:dontariq@gmail.com)>
>> > > > > > > > wrote:
>> > > > > > > > >
>> > > > > > > > > Hello Bhaskar,
>> > > > > > > > >
>> > > > > > > > > That's great..And the best approach to stream logs depends
>> > > > > > > > > upon
>> > > > > > > > > the type of source you want to watch for..And by looking
>> at your
>> > > > > > > > > usecase, I would suggest to go for "multi-hop" flows
>> where events
>> > > > > > > > > travel through multiple agents before reaching the final
>> > > > > > > > > destination.
>> > > > > > > > >
>> > > > > > > > > Regards,
>> > > > > > > > > Mohammad Tariq
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > On Thu, Jun 14, 2012 at 10:48 PM, Bhaskar <
>> bmarthi@gmail.com (mailto:bmarthi@gmail.com)>
>> > > > > > > > > wrote:
>> > > > > > > > > > I know what i am missing:-) I missed connecting the
>> sink with
>> > > > > > > > > > the
>> > > > > > > > > > channel.
>> > > > > > > > > > My small POC works now and i am able to view the
>> streamed logs.
>> > > > > > > > > > Thank
>> > > > > > > > > > you
>> > > > > > > > > > all for the guidance and patience in answering all
>> questions.
>> > > > > > > > > > So,
>> > > > > > > > > > whats
>> > > > > > > > > > the
>> > > > > > > > > > best approach to stream logs from other hosts?
>> Basically my next
>> > > > > > > > > > task
>> > > > > > > > > > would
>> > > > > > > > > > be to set up collector (sort of) model to stream logs to
>> > > > > > > > > > intermediary
>> > > > > > > > > > and
>> > > > > > > > > > then stream from collector to a sink location. I'd
>> appreciate
>> > > > > > > > > > any
>> > > > > > > > > > thoughts/guidance in this regard.
>> > > > > > > > > >
>> > > > > > > > > > Bhaskar
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > On Thu, Jun 14, 2012 at 12:52 PM, Bhaskar <
>> bmarthi@gmail.com (mailto:bmarthi@gmail.com)>
>> > > > > > > > > > wrote:
>> > > > > > > > > > >
>> > > > > > > > > > > For testing purposes, I tried with the following
>> configuration
>> > > > > > > > > > > without
>> > > > > > > > > > > much luck. I see that the process started fine but it
>> just does
>> > > > > > > > > > > not
>> > > > > > > > > > > write
>> > > > > > > > > > > anything to the sink. I guess i am missing something
>> here. Can
>> > > > > > > > > > > one of
>> > > > > > > > > > > you
>> > > > > > > > > > > gurus take a look and suggest what i am doing wrong?
>> > > > > > > > > > >
>> > > > > > > > > > > Thanks,
>> > > > > > > > > > > Bhaskar
>> > > > > > > > > > >
>> > > > > > > > > > > agent1.sources = tail
>> > > > > > > > > > > agent1.channels = MemoryChannel-2
>> > > > > > > > > > > agent1.sinks = svc_0_sink
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > agent1.sources.tail.type = exec
>> > > > > > > > > > > agent1.sources.tail.command = tail -f
>> /var/log/access.log
>> > > > > > > > > > > agent1.sources.tail.channels = MemoryChannel-2
>> > > > > > > > > > >
>> > > > > > > > > > > agent1.sinks.svc_0_sink.type = FILE_ROLL
>> > > > > > > > > > >
>> agent1.sinks.svc_0_sink.sink.directory=/flume_runtime/logs
>> > > > > > > > > > > agent1.sinks.svc_0_sink.rollInterval=0
>> > > > > > > > > > >
>> > > > > > > > > > > agent1.channels.MemoryChannel-2.type = memory
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > On Thu, Jun 14, 2012 at 4:26 AM, Guillaume Polaert
>> > > > > > > > > > > <gpolaert@cyres.fr (mailto:gpolaert@cyres.fr)>
>> > > > > > > > > > > wrote:
>> > > > > > > > > > > >
>> > > > > > > > > > > > Hi Bhaskar,
>> > > > > > > > > > > >
>> > > > > > > > > > > > This is the flume.conf (
>> http://pastebin.com/WULgUuaf) what I'm
>> > > > > > > > > > > > using.
>> > > > > > > > > > > > I have an avro server on the hadoop-m host and one
>> agent per
>> > > > > > > > > > > > node
>> > > > > > > > > > > > (slave
>> > > > > > > > > > > > hosts). Each agent send the ouput of a exec command
>> to avro
>> > > > > > > > > > > > server.
>> > > > > > > > > > > >
>> > > > > > > > > > > > Host1 : exec -> memory -> avro (sink)
>> > > > > > > > > > > >
>> > > > > > > > > > > > Host2 : exec -> memory -> avro
>> > > > > > > > > > > > >>>>>
>> > > > > > > > > > > > MainHost :
>> > > > > > > > > > > > avro
>> > > > > > > > > > > > (source) -> memory -> rolling file (local FS)
>> > > > > > > > > > > > ...
>> > > > > > > > > > > >
>> > > > > > > > > > > > Host3 : exec -> memory -> avro
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > > Use your own exec command to read Apache log.
>> > > > > > > > > > > >
>> > > > > > > > > > > > Guillaume Polaert | Cyrès Conseil
>> > > > > > > > > > > >
>> > > > > > > > > > > > De : Bhaskar [mailto:bmarthi@gmail.com]
>> > > > > > > > > > > > Envoyé : mercredi 13 juin 2012 19:16
>> > > > > > > > > > > > À : flume-user@incubator.apache.org (mailto:
>> flume-user@incubator.apache.org)
>> > > > > > > > > > > > Objet : Newbee question about flume 1.2 set up
>> > > > > > > > > > > >
>> > > > > > > > > > > > Good Afternoon,
>> > > > > > > > > > > > I am a newbee to flume and read thru limited
>> documentation
>> > > > > > > > > > > > available.
>> > > > > > > > > > > > I
>> > > > > > > > > > > > would like to set up the following to test out.
>> > > > > > > > > > > >
>> > > > > > > > > > > > 1. Read apache access logs (as source)
>> > > > > > > > > > > > 2. Use memory channel
>> > > > > > > > > > > > 3. Write it to a NFS (or even local) file system
>> > > > > > > > > > > >
>> > > > > > > > > > > > Can some one help me with the necessary
>> configuration. I am
>> > > > > > > > > > > > having
>> > > > > > > > > > > > difficult time to glean that information from
>> available
>> > > > > > > > > > > > documentation.
>> > > > > > > > > > > > I am
>> > > > > > > > > > > > sure someone has done such test before and i
>> appreciate if you
>> > > > > > > > > > > > can
>> > > > > > > > > > > > pass on
>> > > > > > > > > > > > that information. Secondly, I also would like to
>> stream the
>> > > > > > > > > > > > logs
>> > > > > > > > > > > > to a
>> > > > > > > > > > > > remote server. Is that a log4j configuration or do
>> i need to
>> > > > > > > > > > > > run
>> > > > > > > > > > > > an
>> > > > > > > > > > > > agent
>> > > > > > > > > > > > on each host to do so? Any configuration examples
>> would be of
>> > > > > > > > > > > > great
>> > > > > > > > > > > > help.
>> > > > > > > > > > > >
>> > > > > > > > > > > > Thanks,
>> > > > > > > > > > > > Bhaskar
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>>
>>
>>
>

Re: Newbee question about flume 1.2 set up

Posted by Will McQueen <wi...@cloudera.com>.
Hi Bhaskar,

Are you requesting that the RollingFileSink be able to construct the file name according to escape sequences like what can be done with HDFSEventSink? Please open a ticket with your request.

Cheers,
Will

On Jun 20, 2012, at 9:04 AM, Bhaskar <bm...@gmail.com> wrote:

> I wish similar patch got applied to RollingFileSink as well.  Currently RollingFileSink does not reconstruct the path based on place holder values.  This patch was done to HDFSEventSink.  Do you think i should request that thru Jira?
> 
> Bhaskar
> 
> On Tue, Jun 19, 2012 at 10:59 PM, Mike Percy <mp...@cloudera.com> wrote:
> Will just contributed an Interceptor to provide this out of the box:
> 
> https://issues.apache.org/jira/browse/FLUME-1284
> 
> Regards
> Mike
> 
> 
> On Tuesday, June 19, 2012 at 2:54 PM, Mohammad Tariq wrote:
> 
> > There is no problem from your side..Have a look at this -
> > http://mail-archives.apache.org/mod_mbox/incubator-flume-user/201206.mbox/%3CCAGPLoJKLthyoecEYnJRscahe8q6i4kKH1ADsL3qoQCAQo=ig9g@mail.gmail.com%3E
> >
> > Regards,
> > Mohammad Tariq
> >
> >
> > On Wed, Jun 20, 2012 at 1:42 AM, Bhaskar <bmarthi@gmail.com (mailto:bmarthi@gmail.com)> wrote:
> > > Unfortunately, that part is not working as expected. Must be my mistake
> > > somewhere in the configuration. Here is my sink configuration.
> > >
> > > agent4.sinks.svc_0_sink.type = FILE_ROLL
> > > agent4.sinks.svc_0_sink.sink.directory=/var/logs/agent4/%{host}
> > > agent4.sinks.svc_0_sink.rollInterval=5400
> > > agent4.sinks.svc_0_sink.channel = MemoryChannel-2
> > >
> > > Any thoughts on how to define host specific directory/file?
> > >
> > > Bhaskar
> > >
> > > On Tue, Jun 19, 2012 at 10:01 AM, Mohammad Tariq <dontariq@gmail.com (mailto:dontariq@gmail.com)> wrote:
> > > >
> > > > Hello Bhaskar,
> > > >
> > > > That's great...And we can use "%{host}" as the escape
> > > > sequence to prefix our filenames(am I getting you correctly???).And I
> > > > am waiting anxiously for your guide as I am still a newbie..:-)
> > > >
> > > > Regards,
> > > > Mohammad Tariq
> > > >
> > > >
> > > > On Tue, Jun 19, 2012 at 7:16 PM, Bhaskar <bmarthi@gmail.com (mailto:bmarthi@gmail.com)> wrote:
> > > > > Thank you guys for the responses. I actually was able to get around
> > > > > this
> > > > > problem by tinkering around with my setting. I finally ended up with a
> > > > > capacity of 10000 and commented out transactionCapacity (i originally
> > > > > set it
> > > > > to 10) and it started working. Thanks for the insight. It took me a
> > > > > bit of
> > > > > time to figure out the inner workings of AVRO to get it to send data in
> > > > > correct format. So, i got over that hump:-). Here is my flow for POC.
> > > > >
> > > > > Host A agent --> Source tail exec --> AVRO Client Sink --> jdbc channel
> > > > > (flume-ng avro-client -H <<Host>> -p <<port>> -F <<file to read>>
> > > > > --conf
> > > > > ../conf/)
> > > > > Host B agent --> Source tail exec --> AVRO Client Sink --> jdbc channel
> > > > > (flume-ng avro-client -H <<Host>> -p <<port>> -F <<file to read>>
> > > > > --conf
> > > > > ../conf/)
> > > > > Host C agent --> avro-collector source --> file sink to local directory
> > > > > --
> > > > > Memory channel
> > > > >
> > > > > The issue i am running into is, I am unable to uniquely identify the
> > > > > source
> > > > > of the log in the sink (means the log events from Host A and Host B are
> > > > > combined into the same log on the disk and mixed up). Is there a way to
> > > > > provide unique identifier from the source so that we can track the
> > > > > origin of
> > > > > the log? I am hoping to see in my sink log,
> > > > >
> > > > > Host A -- some log entry
> > > > > Host B -- Some log entry etc
> > > > >
> > > > > Is this feasible or are there any alternative mechanisms to achieve
> > > > > this? I
> > > > > am putting together a new bee guide that might help answer some of these
> > > > > questions for others as i explore this architecture.
> > > > >
> > > > > As always thanks for your assistance,
> > > > > Bhaskar
> > > > >
> > > > >
> > > > > On Tue, Jun 19, 2012 at 2:59 AM, Juhani Connolly
> > > > > <juhani_connolly@cyberagent.co.jp (mailto:juhani_connolly@cyberagent.co.jp)> wrote:
> > > > > >
> > > > > > Hello Bhaskar,
> > > > > >
> > > > > > Using Avro is generally the recommended method to handle multi-hop
> > > > > > flows,
> > > > > > so no concerns there.
> > > > > >
> > > > > > Have you tried this setup using memory channels instead of jdbc? Last
> > > > > > time
> > > > > > I tested it, the JDBC channel had poor throughput, so you may be
> > > > > > getting a
> > > > > > logjam somewhere. How much data is getting entered into your logfile?
> > > > > > Try
> > > > > > raising the capacity on your jdbc channel by a lot(10000?). With a
> > > > > > capacity
> > > > > > of 10, if the reading side(host b) isn't polling frequently enough,
> > > > > > there's
> > > > > > going to be problems. This is probably why you get the "failed to
> > > > > > persist
> > > > > > event". As far as FLUME-1259 is concerned, that should only be
> > > > > > happening if
> > > > > > bad data is being sent. You're not sending anything else to the same
> > > > > > port
> > > > > > are you? Make sure that only the source and sink are set to that port
> > > > > > and
> > > > > > that nothing else is.
> > > > > >
> > > > > > If the problem continues, please post a chunk of the logs leading up to
> > > > > > the OOM error(the full trace for the cause should be enough)
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 06/16/2012 12:01 AM, Bhaskar wrote:
> > > > > >
> > > > > > Sorry to be a pest with stream of questions. I think i am going two
> > > > > > steps
> > > > > > forward and four steps back:-). After my first successful attempt, i
> > > > > > tried
> > > > > > running flume with the following flow:
> > > > > >
> > > > > > 1. HostA
> > > > > > -- Source is tail web server log
> > > > > > -- channel jdbc
> > > > > > -- sink is AVRO collection on Host B
> > > > > > Configuraiton:
> > > > > > agent3.sources = tail
> > > > > > agent3.sinks = avro-forward-sink
> > > > > > agent3.channels = jdbc-channel
> > > > > >
> > > > > > # Define source flow
> > > > > > agent3.sources.tail.type = exec
> > > > > > agent3.sources.tail.command = tail -f /common/log/access.log
> > > > > > agent3.sources.tail.channels = jdbc-channel
> > > > > >
> > > > > > # define the flow
> > > > > > agent3.sinks.avro-forward-sink.channel = jdbc-channel
> > > > > >
> > > > > > # avro sink properties
> > > > > > agent3.sources.avro-forward-sink.type = avro
> > > > > > agent3.sources.avro-forward-sink.hostname = <<IP Address>>
> > > > > > agent3.sources.avro-forward-sink.port = <<PORT>>
> > > > > >
> > > > > > # Define channels
> > > > > > agent3.channels.jdbc-channel.type = jdbc
> > > > > > agent3.channels.jdbc-channel.maximum.capacity = 10
> > > > > > agent3.channels.jdbc-channel.maximum.connections = 2
> > > > > >
> > > > > >
> > > > > > 2. HostB
> > > > > > -- Source is AVRO collection
> > > > > > -- channel is memory
> > > > > > -- sink is local file system
> > > > > >
> > > > > > Configuration:
> > > > > > # list sources, sinks and channels in the agent4
> > > > > > agent4.sources = avro-collection-source
> > > > > > agent4.sinks = svc_0_sink
> > > > > > agent4.channels = MemoryChannel-2
> > > > > >
> > > > > > # define the flow
> > > > > > agent4.sources.avro-collection-source.channels = MemoryChannel-2
> > > > > > agent4.sinks.svc_0_sink.channel = MemoryChannel-2
> > > > > >
> > > > > > # avro sink properties
> > > > > > agent4.sources.avro-collection-source.type = avro
> > > > > > agent4.sources.avro-collection-source.bind = <<IP Address>>
> > > > > > agent4.sources.avro-collection-source.port = <<PORT>>
> > > > > >
> > > > > > agent4.sinks.svc_0_sink.type = FILE_ROLL
> > > > > > agent4.sinks.svc_0_sink.sink.directory=/logs/agent4
> > > > > > agent4.sinks.svc_0_sink.rollInterval=600
> > > > > > agent4.sinks.svc_0_sink.channel = MemoryChannel-2
> > > > > >
> > > > > > agent4.channels.MemoryChannel-2.type = memory
> > > > > > agent4.channels.MemoryChannel-2.capacity = 100
> > > > > > agent4.channels.MemoryChannel-2.transactionCapacity = 10
> > > > > >
> > > > > >
> > > > > > Basically i am trying to tail a file on one host, stream it to another
> > > > > > host running sink. During the trial run, the configuration is loaded
> > > > > > fine
> > > > > > and i see the channels created fine. I see an exception from the jdbc
> > > > > > channel first (Failed to persist event). I am getting a java heap
> > > > > > space OOM
> > > > > > exception from Host B when Host A attempts to write.
> > > > > >
> > > > > > 2012-06-15 10:31:44,503 WARN ipc.NettyServer: Unexpected exception from
> > > > > > downstream.
> > > > > > java.lang.OutOfMemoryError: Java heap space
> > > > > >
> > > > > > This issue was already
> > > > > > reported https://issues.apache.org/jira/browse/FLUME-1259 but i am not
> > > > > > sure
> > > > > > if there is a work around to this problem. I have couple questions:
> > > > > >
> > > > > > 1. Am i force fitting a wrong solution here using AVRO?
> > > > > > 2. if so, what would be a right solution for streaming data from Host
> > > > > > A
> > > > > > to Host B (or thru intermediaries)?
> > > > > >
> > > > > > Thanks,
> > > > > > Bhaskar
> > > > > >
> > > > > > On Thu, Jun 14, 2012 at 4:31 PM, Mohammad Tariq <dontariq@gmail.com (mailto:dontariq@gmail.com)>
> > > > > > wrote:
> > > > > > >
> > > > > > > Since you are thinking of using multi-hop flow I would suggest to go
> > > > > > > for "JDBC Channel" as there is higher chance of error than single-hop
> > > > > > > flow and in JDBC Channel events are stored in a persistent storage
> > > > > > > that’s backed by a database. For detailed guidelines you can refer
> > > > > > > Flume 1.x User Guide at -
> > > > > > >
> > > > > > >
> > > > > > > https://people.apache.org/~mpercy/flume/flume-1.2.0-incubating-SNAPSHOT/docs/FlumeUserGuide.html
> > > > > > >
> > > > > > > Regards,
> > > > > > > Mohammad Tariq
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Jun 15, 2012 at 12:46 AM, Bhaskar <bmarthi@gmail.com (mailto:bmarthi@gmail.com)> wrote:
> > > > > > > > Hi Mohammad,
> > > > > > > > Thanks for the pointer there. Do you think using a message queue
> > > > > > > > (like
> > > > > > > > rabbitmq) would be a choice of communication channel between each
> > > > > > > > hop?
> > > > > > > > i am
> > > > > > > > struggling to get a handle on how i need to configure my sink in
> > > > > > > > intermediary hops in a multi-hop flow. Appreciate any
> > > > > > > > guidance/examples.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Bhaskar
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Jun 14, 2012 at 1:57 PM, Mohammad Tariq <dontariq@gmail.com (mailto:dontariq@gmail.com)>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Hello Bhaskar,
> > > > > > > > >
> > > > > > > > > That's great..And the best approach to stream logs depends
> > > > > > > > > upon
> > > > > > > > > the type of source you want to watch for..And by looking at your
> > > > > > > > > usecase, I would suggest to go for "multi-hop" flows where events
> > > > > > > > > travel through multiple agents before reaching the final
> > > > > > > > > destination.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Mohammad Tariq
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Jun 14, 2012 at 10:48 PM, Bhaskar <bmarthi@gmail.com (mailto:bmarthi@gmail.com)>
> > > > > > > > > wrote:
> > > > > > > > > > I know what i am missing:-) I missed connecting the sink with
> > > > > > > > > > the
> > > > > > > > > > channel.
> > > > > > > > > > My small POC works now and i am able to view the streamed logs.
> > > > > > > > > > Thank
> > > > > > > > > > you
> > > > > > > > > > all for the guidance and patience in answering all questions.
> > > > > > > > > > So,
> > > > > > > > > > whats
> > > > > > > > > > the
> > > > > > > > > > best approach to stream logs from other hosts? Basically my next
> > > > > > > > > > task
> > > > > > > > > > would
> > > > > > > > > > be to set up collector (sort of) model to stream logs to
> > > > > > > > > > intermediary
> > > > > > > > > > and
> > > > > > > > > > then stream from collector to a sink location. I'd appreciate
> > > > > > > > > > any
> > > > > > > > > > thoughts/guidance in this regard.
> > > > > > > > > >
> > > > > > > > > > Bhaskar
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Thu, Jun 14, 2012 at 12:52 PM, Bhaskar <bmarthi@gmail.com (mailto:bmarthi@gmail.com)>
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > For testing purposes, I tried with the following configuration
> > > > > > > > > > > without
> > > > > > > > > > > much luck. I see that the process started fine but it just does
> > > > > > > > > > > not
> > > > > > > > > > > write
> > > > > > > > > > > anything to the sink. I guess i am missing something here. Can
> > > > > > > > > > > one of
> > > > > > > > > > > you
> > > > > > > > > > > gurus take a look and suggest what i am doing wrong?
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Bhaskar
> > > > > > > > > > >
> > > > > > > > > > > agent1.sources = tail
> > > > > > > > > > > agent1.channels = MemoryChannel-2
> > > > > > > > > > > agent1.sinks = svc_0_sink
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > agent1.sources.tail.type = exec
> > > > > > > > > > > agent1.sources.tail.command = tail -f /var/log/access.log
> > > > > > > > > > > agent1.sources.tail.channels = MemoryChannel-2
> > > > > > > > > > >
> > > > > > > > > > > agent1.sinks.svc_0_sink.type = FILE_ROLL
> > > > > > > > > > > agent1.sinks.svc_0_sink.sink.directory=/flume_runtime/logs
> > > > > > > > > > > agent1.sinks.svc_0_sink.rollInterval=0
> > > > > > > > > > >
> > > > > > > > > > > agent1.channels.MemoryChannel-2.type = memory
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Jun 14, 2012 at 4:26 AM, Guillaume Polaert
> > > > > > > > > > > <gpolaert@cyres.fr (mailto:gpolaert@cyres.fr)>
> > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Bhaskar,
> > > > > > > > > > > >
> > > > > > > > > > > > This is the flume.conf (http://pastebin.com/WULgUuaf) what I'm
> > > > > > > > > > > > using.
> > > > > > > > > > > > I have an avro server on the hadoop-m host and one agent per
> > > > > > > > > > > > node
> > > > > > > > > > > > (slave
> > > > > > > > > > > > hosts). Each agent send the ouput of a exec command to avro
> > > > > > > > > > > > server.
> > > > > > > > > > > >
> > > > > > > > > > > > Host1 : exec -> memory -> avro (sink)
> > > > > > > > > > > >
> > > > > > > > > > > > Host2 : exec -> memory -> avro
> > > > > > > > > > > > >>>>>
> > > > > > > > > > > > MainHost :
> > > > > > > > > > > > avro
> > > > > > > > > > > > (source) -> memory -> rolling file (local FS)
> > > > > > > > > > > > ...
> > > > > > > > > > > >
> > > > > > > > > > > > Host3 : exec -> memory -> avro
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Use your own exec command to read Apache log.
> > > > > > > > > > > >
> > > > > > > > > > > > Guillaume Polaert | Cyrès Conseil
> > > > > > > > > > > >
> > > > > > > > > > > > De : Bhaskar [mailto:bmarthi@gmail.com]
> > > > > > > > > > > > Envoyé : mercredi 13 juin 2012 19:16
> > > > > > > > > > > > À : flume-user@incubator.apache.org (mailto:flume-user@incubator.apache.org)
> > > > > > > > > > > > Objet : Newbee question about flume 1.2 set up
> > > > > > > > > > > >
> > > > > > > > > > > > Good Afternoon,
> > > > > > > > > > > > I am a newbee to flume and read thru limited documentation
> > > > > > > > > > > > available.
> > > > > > > > > > > > I
> > > > > > > > > > > > would like to set up the following to test out.
> > > > > > > > > > > >
> > > > > > > > > > > > 1. Read apache access logs (as source)
> > > > > > > > > > > > 2. Use memory channel
> > > > > > > > > > > > 3. Write it to a NFS (or even local) file system
> > > > > > > > > > > >
> > > > > > > > > > > > Can some one help me with the necessary configuration. I am
> > > > > > > > > > > > having
> > > > > > > > > > > > difficult time to glean that information from available
> > > > > > > > > > > > documentation.
> > > > > > > > > > > > I am
> > > > > > > > > > > > sure someone has done such test before and i appreciate if you
> > > > > > > > > > > > can
> > > > > > > > > > > > pass on
> > > > > > > > > > > > that information. Secondly, I also would like to stream the
> > > > > > > > > > > > logs
> > > > > > > > > > > > to a
> > > > > > > > > > > > remote server. Is that a log4j configuration or do i need to
> > > > > > > > > > > > run
> > > > > > > > > > > > an
> > > > > > > > > > > > agent
> > > > > > > > > > > > on each host to do so? Any configuration examples would be of
> > > > > > > > > > > > great
> > > > > > > > > > > > help.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Bhaskar
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> 
> 
> 
> 

Re: Newbee question about flume 1.2 set up

Posted by Bhaskar <bm...@gmail.com>.
I wish similar patch got applied to RollingFileSink as well.  Currently
RollingFileSink does not reconstruct the path based on place holder values.
 This patch was done to HDFSEventSink.  Do you think i should request that
thru Jira?

Bhaskar

On Tue, Jun 19, 2012 at 10:59 PM, Mike Percy <mp...@cloudera.com> wrote:

> Will just contributed an Interceptor to provide this out of the box:
>
> https://issues.apache.org/jira/browse/FLUME-1284
>
> Regards
> Mike
>
>
> On Tuesday, June 19, 2012 at 2:54 PM, Mohammad Tariq wrote:
>
> > There is no problem from your side..Have a look at this -
> >
> http://mail-archives.apache.org/mod_mbox/incubator-flume-user/201206.mbox/%3CCAGPLoJKLthyoecEYnJRscahe8q6i4kKH1ADsL3qoQCAQo=ig9g@mail.gmail.com%3E
> >
> > Regards,
> > Mohammad Tariq
> >
> >
> > On Wed, Jun 20, 2012 at 1:42 AM, Bhaskar <bmarthi@gmail.com (mailto:
> bmarthi@gmail.com)> wrote:
> > > Unfortunately, that part is not working as expected. Must be my mistake
> > > somewhere in the configuration. Here is my sink configuration.
> > >
> > > agent4.sinks.svc_0_sink.type = FILE_ROLL
> > > agent4.sinks.svc_0_sink.sink.directory=/var/logs/agent4/%{host}
> > > agent4.sinks.svc_0_sink.rollInterval=5400
> > > agent4.sinks.svc_0_sink.channel = MemoryChannel-2
> > >
> > > Any thoughts on how to define host specific directory/file?
> > >
> > > Bhaskar
> > >
> > > On Tue, Jun 19, 2012 at 10:01 AM, Mohammad Tariq <dontariq@gmail.com(mailto:
> dontariq@gmail.com)> wrote:
> > > >
> > > > Hello Bhaskar,
> > > >
> > > > That's great...And we can use "%{host}" as the escape
> > > > sequence to prefix our filenames(am I getting you correctly???).And I
> > > > am waiting anxiously for your guide as I am still a newbie..:-)
> > > >
> > > > Regards,
> > > > Mohammad Tariq
> > > >
> > > >
> > > > On Tue, Jun 19, 2012 at 7:16 PM, Bhaskar <bmarthi@gmail.com (mailto:
> bmarthi@gmail.com)> wrote:
> > > > > Thank you guys for the responses. I actually was able to get around
> > > > > this
> > > > > problem by tinkering around with my setting. I finally ended up
> with a
> > > > > capacity of 10000 and commented out transactionCapacity (i
> originally
> > > > > set it
> > > > > to 10) and it started working. Thanks for the insight. It took me a
> > > > > bit of
> > > > > time to figure out the inner workings of AVRO to get it to send
> data in
> > > > > correct format. So, i got over that hump:-). Here is my flow for
> POC.
> > > > >
> > > > > Host A agent --> Source tail exec --> AVRO Client Sink --> jdbc
> channel
> > > > > (flume-ng avro-client -H <<Host>> -p <<port>> -F <<file to read>>
> > > > > --conf
> > > > > ../conf/)
> > > > > Host B agent --> Source tail exec --> AVRO Client Sink --> jdbc
> channel
> > > > > (flume-ng avro-client -H <<Host>> -p <<port>> -F <<file to read>>
> > > > > --conf
> > > > > ../conf/)
> > > > > Host C agent --> avro-collector source --> file sink to local
> directory
> > > > > --
> > > > > Memory channel
> > > > >
> > > > > The issue i am running into is, I am unable to uniquely identify
> the
> > > > > source
> > > > > of the log in the sink (means the log events from Host A and Host
> B are
> > > > > combined into the same log on the disk and mixed up). Is there a
> way to
> > > > > provide unique identifier from the source so that we can track the
> > > > > origin of
> > > > > the log? I am hoping to see in my sink log,
> > > > >
> > > > > Host A -- some log entry
> > > > > Host B -- Some log entry etc
> > > > >
> > > > > Is this feasible or are there any alternative mechanisms to achieve
> > > > > this? I
> > > > > am putting together a new bee guide that might help answer some of
> these
> > > > > questions for others as i explore this architecture.
> > > > >
> > > > > As always thanks for your assistance,
> > > > > Bhaskar
> > > > >
> > > > >
> > > > > On Tue, Jun 19, 2012 at 2:59 AM, Juhani Connolly
> > > > > <juhani_connolly@cyberagent.co.jp (mailto:
> juhani_connolly@cyberagent.co.jp)> wrote:
> > > > > >
> > > > > > Hello Bhaskar,
> > > > > >
> > > > > > Using Avro is generally the recommended method to handle
> multi-hop
> > > > > > flows,
> > > > > > so no concerns there.
> > > > > >
> > > > > > Have you tried this setup using memory channels instead of jdbc?
> Last
> > > > > > time
> > > > > > I tested it, the JDBC channel had poor throughput, so you may be
> > > > > > getting a
> > > > > > logjam somewhere. How much data is getting entered into your
> logfile?
> > > > > > Try
> > > > > > raising the capacity on your jdbc channel by a lot(10000?). With
> a
> > > > > > capacity
> > > > > > of 10, if the reading side(host b) isn't polling frequently
> enough,
> > > > > > there's
> > > > > > going to be problems. This is probably why you get the "failed to
> > > > > > persist
> > > > > > event". As far as FLUME-1259 is concerned, that should only be
> > > > > > happening if
> > > > > > bad data is being sent. You're not sending anything else to the
> same
> > > > > > port
> > > > > > are you? Make sure that only the source and sink are set to that
> port
> > > > > > and
> > > > > > that nothing else is.
> > > > > >
> > > > > > If the problem continues, please post a chunk of the logs
> leading up to
> > > > > > the OOM error(the full trace for the cause should be enough)
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 06/16/2012 12:01 AM, Bhaskar wrote:
> > > > > >
> > > > > > Sorry to be a pest with stream of questions. I think i am going
> two
> > > > > > steps
> > > > > > forward and four steps back:-). After my first successful
> attempt, i
> > > > > > tried
> > > > > > running flume with the following flow:
> > > > > >
> > > > > > 1. HostA
> > > > > > -- Source is tail web server log
> > > > > > -- channel jdbc
> > > > > > -- sink is AVRO collection on Host B
> > > > > > Configuraiton:
> > > > > > agent3.sources = tail
> > > > > > agent3.sinks = avro-forward-sink
> > > > > > agent3.channels = jdbc-channel
> > > > > >
> > > > > > # Define source flow
> > > > > > agent3.sources.tail.type = exec
> > > > > > agent3.sources.tail.command = tail -f /common/log/access.log
> > > > > > agent3.sources.tail.channels = jdbc-channel
> > > > > >
> > > > > > # define the flow
> > > > > > agent3.sinks.avro-forward-sink.channel = jdbc-channel
> > > > > >
> > > > > > # avro sink properties
> > > > > > agent3.sources.avro-forward-sink.type = avro
> > > > > > agent3.sources.avro-forward-sink.hostname = <<IP Address>>
> > > > > > agent3.sources.avro-forward-sink.port = <<PORT>>
> > > > > >
> > > > > > # Define channels
> > > > > > agent3.channels.jdbc-channel.type = jdbc
> > > > > > agent3.channels.jdbc-channel.maximum.capacity = 10
> > > > > > agent3.channels.jdbc-channel.maximum.connections = 2
> > > > > >
> > > > > >
> > > > > > 2. HostB
> > > > > > -- Source is AVRO collection
> > > > > > -- channel is memory
> > > > > > -- sink is local file system
> > > > > >
> > > > > > Configuration:
> > > > > > # list sources, sinks and channels in the agent4
> > > > > > agent4.sources = avro-collection-source
> > > > > > agent4.sinks = svc_0_sink
> > > > > > agent4.channels = MemoryChannel-2
> > > > > >
> > > > > > # define the flow
> > > > > > agent4.sources.avro-collection-source.channels = MemoryChannel-2
> > > > > > agent4.sinks.svc_0_sink.channel = MemoryChannel-2
> > > > > >
> > > > > > # avro sink properties
> > > > > > agent4.sources.avro-collection-source.type = avro
> > > > > > agent4.sources.avro-collection-source.bind = <<IP Address>>
> > > > > > agent4.sources.avro-collection-source.port = <<PORT>>
> > > > > >
> > > > > > agent4.sinks.svc_0_sink.type = FILE_ROLL
> > > > > > agent4.sinks.svc_0_sink.sink.directory=/logs/agent4
> > > > > > agent4.sinks.svc_0_sink.rollInterval=600
> > > > > > agent4.sinks.svc_0_sink.channel = MemoryChannel-2
> > > > > >
> > > > > > agent4.channels.MemoryChannel-2.type = memory
> > > > > > agent4.channels.MemoryChannel-2.capacity = 100
> > > > > > agent4.channels.MemoryChannel-2.transactionCapacity = 10
> > > > > >
> > > > > >
> > > > > > Basically i am trying to tail a file on one host, stream it to
> another
> > > > > > host running sink. During the trial run, the configuration is
> loaded
> > > > > > fine
> > > > > > and i see the channels created fine. I see an exception from the
> jdbc
> > > > > > channel first (Failed to persist event). I am getting a java heap
> > > > > > space OOM
> > > > > > exception from Host B when Host A attempts to write.
> > > > > >
> > > > > > 2012-06-15 10:31:44,503 WARN ipc.NettyServer: Unexpected
> exception from
> > > > > > downstream.
> > > > > > java.lang.OutOfMemoryError: Java heap space
> > > > > >
> > > > > > This issue was already
> > > > > > reported https://issues.apache.org/jira/browse/FLUME-1259 but i
> am not
> > > > > > sure
> > > > > > if there is a work around to this problem. I have couple
> questions:
> > > > > >
> > > > > > 1. Am i force fitting a wrong solution here using AVRO?
> > > > > > 2. if so, what would be a right solution for streaming data from
> Host
> > > > > > A
> > > > > > to Host B (or thru intermediaries)?
> > > > > >
> > > > > > Thanks,
> > > > > > Bhaskar
> > > > > >
> > > > > > On Thu, Jun 14, 2012 at 4:31 PM, Mohammad Tariq <
> dontariq@gmail.com (mailto:dontariq@gmail.com)>
> > > > > > wrote:
> > > > > > >
> > > > > > > Since you are thinking of using multi-hop flow I would suggest
> to go
> > > > > > > for "JDBC Channel" as there is higher chance of error than
> single-hop
> > > > > > > flow and in JDBC Channel events are stored in a persistent
> storage
> > > > > > > that’s backed by a database. For detailed guidelines you can
> refer
> > > > > > > Flume 1.x User Guide at -
> > > > > > >
> > > > > > >
> > > > > > >
> https://people.apache.org/~mpercy/flume/flume-1.2.0-incubating-SNAPSHOT/docs/FlumeUserGuide.html
> > > > > > >
> > > > > > > Regards,
> > > > > > > Mohammad Tariq
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Jun 15, 2012 at 12:46 AM, Bhaskar <bmarthi@gmail.com(mailto:
> bmarthi@gmail.com)> wrote:
> > > > > > > > Hi Mohammad,
> > > > > > > > Thanks for the pointer there. Do you think using a message
> queue
> > > > > > > > (like
> > > > > > > > rabbitmq) would be a choice of communication channel between
> each
> > > > > > > > hop?
> > > > > > > > i am
> > > > > > > > struggling to get a handle on how i need to configure my
> sink in
> > > > > > > > intermediary hops in a multi-hop flow. Appreciate any
> > > > > > > > guidance/examples.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Bhaskar
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Jun 14, 2012 at 1:57 PM, Mohammad Tariq <
> dontariq@gmail.com (mailto:dontariq@gmail.com)>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Hello Bhaskar,
> > > > > > > > >
> > > > > > > > > That's great..And the best approach to stream logs depends
> > > > > > > > > upon
> > > > > > > > > the type of source you want to watch for..And by looking
> at your
> > > > > > > > > usecase, I would suggest to go for "multi-hop" flows where
> events
> > > > > > > > > travel through multiple agents before reaching the final
> > > > > > > > > destination.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Mohammad Tariq
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Jun 14, 2012 at 10:48 PM, Bhaskar <
> bmarthi@gmail.com (mailto:bmarthi@gmail.com)>
> > > > > > > > > wrote:
> > > > > > > > > > I know what i am missing:-) I missed connecting the sink
> with
> > > > > > > > > > the
> > > > > > > > > > channel.
> > > > > > > > > > My small POC works now and i am able to view the
> streamed logs.
> > > > > > > > > > Thank
> > > > > > > > > > you
> > > > > > > > > > all for the guidance and patience in answering all
> questions.
> > > > > > > > > > So,
> > > > > > > > > > whats
> > > > > > > > > > the
> > > > > > > > > > best approach to stream logs from other hosts? Basically
> my next
> > > > > > > > > > task
> > > > > > > > > > would
> > > > > > > > > > be to set up collector (sort of) model to stream logs to
> > > > > > > > > > intermediary
> > > > > > > > > > and
> > > > > > > > > > then stream from collector to a sink location. I'd
> appreciate
> > > > > > > > > > any
> > > > > > > > > > thoughts/guidance in this regard.
> > > > > > > > > >
> > > > > > > > > > Bhaskar
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Thu, Jun 14, 2012 at 12:52 PM, Bhaskar <
> bmarthi@gmail.com (mailto:bmarthi@gmail.com)>
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > For testing purposes, I tried with the following
> configuration
> > > > > > > > > > > without
> > > > > > > > > > > much luck. I see that the process started fine but it
> just does
> > > > > > > > > > > not
> > > > > > > > > > > write
> > > > > > > > > > > anything to the sink. I guess i am missing something
> here. Can
> > > > > > > > > > > one of
> > > > > > > > > > > you
> > > > > > > > > > > gurus take a look and suggest what i am doing wrong?
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Bhaskar
> > > > > > > > > > >
> > > > > > > > > > > agent1.sources = tail
> > > > > > > > > > > agent1.channels = MemoryChannel-2
> > > > > > > > > > > agent1.sinks = svc_0_sink
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > agent1.sources.tail.type = exec
> > > > > > > > > > > agent1.sources.tail.command = tail -f
> /var/log/access.log
> > > > > > > > > > > agent1.sources.tail.channels = MemoryChannel-2
> > > > > > > > > > >
> > > > > > > > > > > agent1.sinks.svc_0_sink.type = FILE_ROLL
> > > > > > > > > > >
> agent1.sinks.svc_0_sink.sink.directory=/flume_runtime/logs
> > > > > > > > > > > agent1.sinks.svc_0_sink.rollInterval=0
> > > > > > > > > > >
> > > > > > > > > > > agent1.channels.MemoryChannel-2.type = memory
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Jun 14, 2012 at 4:26 AM, Guillaume Polaert
> > > > > > > > > > > <gpolaert@cyres.fr (mailto:gpolaert@cyres.fr)>
> > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Bhaskar,
> > > > > > > > > > > >
> > > > > > > > > > > > This is the flume.conf (http://pastebin.com/WULgUuaf)
> what I'm
> > > > > > > > > > > > using.
> > > > > > > > > > > > I have an avro server on the hadoop-m host and one
> agent per
> > > > > > > > > > > > node
> > > > > > > > > > > > (slave
> > > > > > > > > > > > hosts). Each agent send the ouput of a exec command
> to avro
> > > > > > > > > > > > server.
> > > > > > > > > > > >
> > > > > > > > > > > > Host1 : exec -> memory -> avro (sink)
> > > > > > > > > > > >
> > > > > > > > > > > > Host2 : exec -> memory -> avro
> > > > > > > > > > > > >>>>>
> > > > > > > > > > > > MainHost :
> > > > > > > > > > > > avro
> > > > > > > > > > > > (source) -> memory -> rolling file (local FS)
> > > > > > > > > > > > ...
> > > > > > > > > > > >
> > > > > > > > > > > > Host3 : exec -> memory -> avro
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Use your own exec command to read Apache log.
> > > > > > > > > > > >
> > > > > > > > > > > > Guillaume Polaert | Cyrès Conseil
> > > > > > > > > > > >
> > > > > > > > > > > > De : Bhaskar [mailto:bmarthi@gmail.com]
> > > > > > > > > > > > Envoyé : mercredi 13 juin 2012 19:16
> > > > > > > > > > > > À : flume-user@incubator.apache.org (mailto:
> flume-user@incubator.apache.org)
> > > > > > > > > > > > Objet : Newbee question about flume 1.2 set up
> > > > > > > > > > > >
> > > > > > > > > > > > Good Afternoon,
> > > > > > > > > > > > I am a newbee to flume and read thru limited
> documentation
> > > > > > > > > > > > available.
> > > > > > > > > > > > I
> > > > > > > > > > > > would like to set up the following to test out.
> > > > > > > > > > > >
> > > > > > > > > > > > 1. Read apache access logs (as source)
> > > > > > > > > > > > 2. Use memory channel
> > > > > > > > > > > > 3. Write it to a NFS (or even local) file system
> > > > > > > > > > > >
> > > > > > > > > > > > Can some one help me with the necessary
> configuration. I am
> > > > > > > > > > > > having
> > > > > > > > > > > > difficult time to glean that information from
> available
> > > > > > > > > > > > documentation.
> > > > > > > > > > > > I am
> > > > > > > > > > > > sure someone has done such test before and i
> appreciate if you
> > > > > > > > > > > > can
> > > > > > > > > > > > pass on
> > > > > > > > > > > > that information. Secondly, I also would like to
> stream the
> > > > > > > > > > > > logs
> > > > > > > > > > > > to a
> > > > > > > > > > > > remote server. Is that a log4j configuration or do i
> need to
> > > > > > > > > > > > run
> > > > > > > > > > > > an
> > > > > > > > > > > > agent
> > > > > > > > > > > > on each host to do so? Any configuration examples
> would be of
> > > > > > > > > > > > great
> > > > > > > > > > > > help.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Bhaskar
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
>
>

Re: Newbee question about flume 1.2 set up

Posted by Mike Percy <mp...@cloudera.com>.
Will just contributed an Interceptor to provide this out of the box:  

https://issues.apache.org/jira/browse/FLUME-1284

Regards
Mike


On Tuesday, June 19, 2012 at 2:54 PM, Mohammad Tariq wrote:

> There is no problem from your side..Have a look at this -
> http://mail-archives.apache.org/mod_mbox/incubator-flume-user/201206.mbox/%3CCAGPLoJKLthyoecEYnJRscahe8q6i4kKH1ADsL3qoQCAQo=ig9g@mail.gmail.com%3E
>  
> Regards,
> Mohammad Tariq
>  
>  
> On Wed, Jun 20, 2012 at 1:42 AM, Bhaskar <bmarthi@gmail.com (mailto:bmarthi@gmail.com)> wrote:
> > Unfortunately, that part is not working as expected. Must be my mistake
> > somewhere in the configuration. Here is my sink configuration.
> >  
> > agent4.sinks.svc_0_sink.type = FILE_ROLL
> > agent4.sinks.svc_0_sink.sink.directory=/var/logs/agent4/%{host}
> > agent4.sinks.svc_0_sink.rollInterval=5400
> > agent4.sinks.svc_0_sink.channel = MemoryChannel-2
> >  
> > Any thoughts on how to define host specific directory/file?
> >  
> > Bhaskar
> >  
> > On Tue, Jun 19, 2012 at 10:01 AM, Mohammad Tariq <dontariq@gmail.com (mailto:dontariq@gmail.com)> wrote:
> > >  
> > > Hello Bhaskar,
> > >  
> > > That's great...And we can use "%{host}" as the escape
> > > sequence to prefix our filenames(am I getting you correctly???).And I
> > > am waiting anxiously for your guide as I am still a newbie..:-)
> > >  
> > > Regards,
> > > Mohammad Tariq
> > >  
> > >  
> > > On Tue, Jun 19, 2012 at 7:16 PM, Bhaskar <bmarthi@gmail.com (mailto:bmarthi@gmail.com)> wrote:
> > > > Thank you guys for the responses. I actually was able to get around
> > > > this
> > > > problem by tinkering around with my setting. I finally ended up with a
> > > > capacity of 10000 and commented out transactionCapacity (i originally
> > > > set it
> > > > to 10) and it started working. Thanks for the insight. It took me a
> > > > bit of
> > > > time to figure out the inner workings of AVRO to get it to send data in
> > > > correct format. So, i got over that hump:-). Here is my flow for POC.
> > > >  
> > > > Host A agent --> Source tail exec --> AVRO Client Sink --> jdbc channel
> > > > (flume-ng avro-client -H <<Host>> -p <<port>> -F <<file to read>>
> > > > --conf
> > > > ../conf/)
> > > > Host B agent --> Source tail exec --> AVRO Client Sink --> jdbc channel
> > > > (flume-ng avro-client -H <<Host>> -p <<port>> -F <<file to read>>
> > > > --conf
> > > > ../conf/)
> > > > Host C agent --> avro-collector source --> file sink to local directory
> > > > --
> > > > Memory channel
> > > >  
> > > > The issue i am running into is, I am unable to uniquely identify the
> > > > source
> > > > of the log in the sink (means the log events from Host A and Host B are
> > > > combined into the same log on the disk and mixed up). Is there a way to
> > > > provide unique identifier from the source so that we can track the
> > > > origin of
> > > > the log? I am hoping to see in my sink log,
> > > >  
> > > > Host A -- some log entry
> > > > Host B -- Some log entry etc
> > > >  
> > > > Is this feasible or are there any alternative mechanisms to achieve
> > > > this? I
> > > > am putting together a new bee guide that might help answer some of these
> > > > questions for others as i explore this architecture.
> > > >  
> > > > As always thanks for your assistance,
> > > > Bhaskar
> > > >  
> > > >  
> > > > On Tue, Jun 19, 2012 at 2:59 AM, Juhani Connolly
> > > > <juhani_connolly@cyberagent.co.jp (mailto:juhani_connolly@cyberagent.co.jp)> wrote:
> > > > >  
> > > > > Hello Bhaskar,
> > > > >  
> > > > > Using Avro is generally the recommended method to handle multi-hop
> > > > > flows,
> > > > > so no concerns there.
> > > > >  
> > > > > Have you tried this setup using memory channels instead of jdbc? Last
> > > > > time
> > > > > I tested it, the JDBC channel had poor throughput, so you may be
> > > > > getting a
> > > > > logjam somewhere. How much data is getting entered into your logfile?
> > > > > Try
> > > > > raising the capacity on your jdbc channel by a lot(10000?). With a
> > > > > capacity
> > > > > of 10, if the reading side(host b) isn't polling frequently enough,
> > > > > there's
> > > > > going to be problems. This is probably why you get the "failed to
> > > > > persist
> > > > > event". As far as FLUME-1259 is concerned, that should only be
> > > > > happening if
> > > > > bad data is being sent. You're not sending anything else to the same
> > > > > port
> > > > > are you? Make sure that only the source and sink are set to that port
> > > > > and
> > > > > that nothing else is.
> > > > >  
> > > > > If the problem continues, please post a chunk of the logs leading up to
> > > > > the OOM error(the full trace for the cause should be enough)
> > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > >  
> > > > > On 06/16/2012 12:01 AM, Bhaskar wrote:
> > > > >  
> > > > > Sorry to be a pest with stream of questions. I think i am going two
> > > > > steps
> > > > > forward and four steps back:-). After my first successful attempt, i
> > > > > tried
> > > > > running flume with the following flow:
> > > > >  
> > > > > 1. HostA
> > > > > -- Source is tail web server log
> > > > > -- channel jdbc
> > > > > -- sink is AVRO collection on Host B
> > > > > Configuraiton:
> > > > > agent3.sources = tail
> > > > > agent3.sinks = avro-forward-sink
> > > > > agent3.channels = jdbc-channel
> > > > >  
> > > > > # Define source flow
> > > > > agent3.sources.tail.type = exec
> > > > > agent3.sources.tail.command = tail -f /common/log/access.log
> > > > > agent3.sources.tail.channels = jdbc-channel
> > > > >  
> > > > > # define the flow
> > > > > agent3.sinks.avro-forward-sink.channel = jdbc-channel
> > > > >  
> > > > > # avro sink properties
> > > > > agent3.sources.avro-forward-sink.type = avro
> > > > > agent3.sources.avro-forward-sink.hostname = <<IP Address>>
> > > > > agent3.sources.avro-forward-sink.port = <<PORT>>
> > > > >  
> > > > > # Define channels
> > > > > agent3.channels.jdbc-channel.type = jdbc
> > > > > agent3.channels.jdbc-channel.maximum.capacity = 10
> > > > > agent3.channels.jdbc-channel.maximum.connections = 2
> > > > >  
> > > > >  
> > > > > 2. HostB
> > > > > -- Source is AVRO collection
> > > > > -- channel is memory
> > > > > -- sink is local file system
> > > > >  
> > > > > Configuration:
> > > > > # list sources, sinks and channels in the agent4
> > > > > agent4.sources = avro-collection-source
> > > > > agent4.sinks = svc_0_sink
> > > > > agent4.channels = MemoryChannel-2
> > > > >  
> > > > > # define the flow
> > > > > agent4.sources.avro-collection-source.channels = MemoryChannel-2
> > > > > agent4.sinks.svc_0_sink.channel = MemoryChannel-2
> > > > >  
> > > > > # avro sink properties
> > > > > agent4.sources.avro-collection-source.type = avro
> > > > > agent4.sources.avro-collection-source.bind = <<IP Address>>
> > > > > agent4.sources.avro-collection-source.port = <<PORT>>
> > > > >  
> > > > > agent4.sinks.svc_0_sink.type = FILE_ROLL
> > > > > agent4.sinks.svc_0_sink.sink.directory=/logs/agent4
> > > > > agent4.sinks.svc_0_sink.rollInterval=600
> > > > > agent4.sinks.svc_0_sink.channel = MemoryChannel-2
> > > > >  
> > > > > agent4.channels.MemoryChannel-2.type = memory
> > > > > agent4.channels.MemoryChannel-2.capacity = 100
> > > > > agent4.channels.MemoryChannel-2.transactionCapacity = 10
> > > > >  
> > > > >  
> > > > > Basically i am trying to tail a file on one host, stream it to another
> > > > > host running sink. During the trial run, the configuration is loaded
> > > > > fine
> > > > > and i see the channels created fine. I see an exception from the jdbc
> > > > > channel first (Failed to persist event). I am getting a java heap
> > > > > space OOM
> > > > > exception from Host B when Host A attempts to write.
> > > > >  
> > > > > 2012-06-15 10:31:44,503 WARN ipc.NettyServer: Unexpected exception from
> > > > > downstream.
> > > > > java.lang.OutOfMemoryError: Java heap space
> > > > >  
> > > > > This issue was already
> > > > > reported https://issues.apache.org/jira/browse/FLUME-1259 but i am not
> > > > > sure
> > > > > if there is a work around to this problem. I have couple questions:
> > > > >  
> > > > > 1. Am i force fitting a wrong solution here using AVRO?
> > > > > 2. if so, what would be a right solution for streaming data from Host
> > > > > A
> > > > > to Host B (or thru intermediaries)?
> > > > >  
> > > > > Thanks,
> > > > > Bhaskar
> > > > >  
> > > > > On Thu, Jun 14, 2012 at 4:31 PM, Mohammad Tariq <dontariq@gmail.com (mailto:dontariq@gmail.com)>
> > > > > wrote:
> > > > > >  
> > > > > > Since you are thinking of using multi-hop flow I would suggest to go
> > > > > > for "JDBC Channel" as there is higher chance of error than single-hop
> > > > > > flow and in JDBC Channel events are stored in a persistent storage
> > > > > > that’s backed by a database. For detailed guidelines you can refer
> > > > > > Flume 1.x User Guide at -
> > > > > >  
> > > > > >  
> > > > > > https://people.apache.org/~mpercy/flume/flume-1.2.0-incubating-SNAPSHOT/docs/FlumeUserGuide.html
> > > > > >  
> > > > > > Regards,
> > > > > > Mohammad Tariq
> > > > > >  
> > > > > >  
> > > > > > On Fri, Jun 15, 2012 at 12:46 AM, Bhaskar <bmarthi@gmail.com (mailto:bmarthi@gmail.com)> wrote:
> > > > > > > Hi Mohammad,
> > > > > > > Thanks for the pointer there. Do you think using a message queue
> > > > > > > (like
> > > > > > > rabbitmq) would be a choice of communication channel between each
> > > > > > > hop?
> > > > > > > i am
> > > > > > > struggling to get a handle on how i need to configure my sink in
> > > > > > > intermediary hops in a multi-hop flow. Appreciate any
> > > > > > > guidance/examples.
> > > > > > >  
> > > > > > > Thanks,
> > > > > > > Bhaskar
> > > > > > >  
> > > > > > >  
> > > > > > > On Thu, Jun 14, 2012 at 1:57 PM, Mohammad Tariq <dontariq@gmail.com (mailto:dontariq@gmail.com)>
> > > > > > > wrote:
> > > > > > > >  
> > > > > > > > Hello Bhaskar,
> > > > > > > >  
> > > > > > > > That's great..And the best approach to stream logs depends
> > > > > > > > upon
> > > > > > > > the type of source you want to watch for..And by looking at your
> > > > > > > > usecase, I would suggest to go for "multi-hop" flows where events
> > > > > > > > travel through multiple agents before reaching the final
> > > > > > > > destination.
> > > > > > > >  
> > > > > > > > Regards,
> > > > > > > > Mohammad Tariq
> > > > > > > >  
> > > > > > > >  
> > > > > > > > On Thu, Jun 14, 2012 at 10:48 PM, Bhaskar <bmarthi@gmail.com (mailto:bmarthi@gmail.com)>
> > > > > > > > wrote:
> > > > > > > > > I know what i am missing:-) I missed connecting the sink with
> > > > > > > > > the
> > > > > > > > > channel.
> > > > > > > > > My small POC works now and i am able to view the streamed logs.
> > > > > > > > > Thank
> > > > > > > > > you
> > > > > > > > > all for the guidance and patience in answering all questions.
> > > > > > > > > So,
> > > > > > > > > whats
> > > > > > > > > the
> > > > > > > > > best approach to stream logs from other hosts? Basically my next
> > > > > > > > > task
> > > > > > > > > would
> > > > > > > > > be to set up collector (sort of) model to stream logs to
> > > > > > > > > intermediary
> > > > > > > > > and
> > > > > > > > > then stream from collector to a sink location. I'd appreciate
> > > > > > > > > any
> > > > > > > > > thoughts/guidance in this regard.
> > > > > > > > >  
> > > > > > > > > Bhaskar
> > > > > > > > >  
> > > > > > > > >  
> > > > > > > > > On Thu, Jun 14, 2012 at 12:52 PM, Bhaskar <bmarthi@gmail.com (mailto:bmarthi@gmail.com)>
> > > > > > > > > wrote:
> > > > > > > > > >  
> > > > > > > > > > For testing purposes, I tried with the following configuration
> > > > > > > > > > without
> > > > > > > > > > much luck. I see that the process started fine but it just does
> > > > > > > > > > not
> > > > > > > > > > write
> > > > > > > > > > anything to the sink. I guess i am missing something here. Can
> > > > > > > > > > one of
> > > > > > > > > > you
> > > > > > > > > > gurus take a look and suggest what i am doing wrong?
> > > > > > > > > >  
> > > > > > > > > > Thanks,
> > > > > > > > > > Bhaskar
> > > > > > > > > >  
> > > > > > > > > > agent1.sources = tail
> > > > > > > > > > agent1.channels = MemoryChannel-2
> > > > > > > > > > agent1.sinks = svc_0_sink
> > > > > > > > > >  
> > > > > > > > > >  
> > > > > > > > > > agent1.sources.tail.type = exec
> > > > > > > > > > agent1.sources.tail.command = tail -f /var/log/access.log
> > > > > > > > > > agent1.sources.tail.channels = MemoryChannel-2
> > > > > > > > > >  
> > > > > > > > > > agent1.sinks.svc_0_sink.type = FILE_ROLL
> > > > > > > > > > agent1.sinks.svc_0_sink.sink.directory=/flume_runtime/logs
> > > > > > > > > > agent1.sinks.svc_0_sink.rollInterval=0
> > > > > > > > > >  
> > > > > > > > > > agent1.channels.MemoryChannel-2.type = memory
> > > > > > > > > >  
> > > > > > > > > >  
> > > > > > > > > > On Thu, Jun 14, 2012 at 4:26 AM, Guillaume Polaert
> > > > > > > > > > <gpolaert@cyres.fr (mailto:gpolaert@cyres.fr)>
> > > > > > > > > > wrote:
> > > > > > > > > > >  
> > > > > > > > > > > Hi Bhaskar,
> > > > > > > > > > >  
> > > > > > > > > > > This is the flume.conf (http://pastebin.com/WULgUuaf) what I'm
> > > > > > > > > > > using.
> > > > > > > > > > > I have an avro server on the hadoop-m host and one agent per
> > > > > > > > > > > node
> > > > > > > > > > > (slave
> > > > > > > > > > > hosts). Each agent send the ouput of a exec command to avro
> > > > > > > > > > > server.
> > > > > > > > > > >  
> > > > > > > > > > > Host1 : exec -> memory -> avro (sink)
> > > > > > > > > > >  
> > > > > > > > > > > Host2 : exec -> memory -> avro
> > > > > > > > > > > >>>>>
> > > > > > > > > > > MainHost :
> > > > > > > > > > > avro
> > > > > > > > > > > (source) -> memory -> rolling file (local FS)
> > > > > > > > > > > ...
> > > > > > > > > > >  
> > > > > > > > > > > Host3 : exec -> memory -> avro
> > > > > > > > > > >  
> > > > > > > > > > >  
> > > > > > > > > > > Use your own exec command to read Apache log.
> > > > > > > > > > >  
> > > > > > > > > > > Guillaume Polaert | Cyrès Conseil
> > > > > > > > > > >  
> > > > > > > > > > > De : Bhaskar [mailto:bmarthi@gmail.com]
> > > > > > > > > > > Envoyé : mercredi 13 juin 2012 19:16
> > > > > > > > > > > À : flume-user@incubator.apache.org (mailto:flume-user@incubator.apache.org)
> > > > > > > > > > > Objet : Newbee question about flume 1.2 set up
> > > > > > > > > > >  
> > > > > > > > > > > Good Afternoon,
> > > > > > > > > > > I am a newbee to flume and read thru limited documentation
> > > > > > > > > > > available.
> > > > > > > > > > > I
> > > > > > > > > > > would like to set up the following to test out.
> > > > > > > > > > >  
> > > > > > > > > > > 1. Read apache access logs (as source)
> > > > > > > > > > > 2. Use memory channel
> > > > > > > > > > > 3. Write it to a NFS (or even local) file system
> > > > > > > > > > >  
> > > > > > > > > > > Can some one help me with the necessary configuration. I am
> > > > > > > > > > > having
> > > > > > > > > > > difficult time to glean that information from available
> > > > > > > > > > > documentation.
> > > > > > > > > > > I am
> > > > > > > > > > > sure someone has done such test before and i appreciate if you
> > > > > > > > > > > can
> > > > > > > > > > > pass on
> > > > > > > > > > > that information. Secondly, I also would like to stream the
> > > > > > > > > > > logs
> > > > > > > > > > > to a
> > > > > > > > > > > remote server. Is that a log4j configuration or do i need to
> > > > > > > > > > > run
> > > > > > > > > > > an
> > > > > > > > > > > agent
> > > > > > > > > > > on each host to do so? Any configuration examples would be of
> > > > > > > > > > > great
> > > > > > > > > > > help.
> > > > > > > > > > >  
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Bhaskar
> > > > > > > > > >  
> > > > > > > > >  
> > > > > > > >  
> > > > > > >  
> > > > > >  
> > > > >  
> > > >  
> > >  
> >  
>  




Re: Newbee question about flume 1.2 set up

Posted by Mohammad Tariq <do...@gmail.com>.
There is no problem from your side..Have a look at this -
http://mail-archives.apache.org/mod_mbox/incubator-flume-user/201206.mbox/%3CCAGPLoJKLthyoecEYnJRscahe8q6i4kKH1ADsL3qoQCAQo=ig9g@mail.gmail.com%3E

Regards,
    Mohammad Tariq


On Wed, Jun 20, 2012 at 1:42 AM, Bhaskar <bm...@gmail.com> wrote:
> Unfortunately, that part is not working as expected.  Must be my mistake
> somewhere in the configuration.  Here is my sink configuration.
>
> agent4.sinks.svc_0_sink.type = FILE_ROLL
> agent4.sinks.svc_0_sink.sink.directory=/var/logs/agent4/%{host}
> agent4.sinks.svc_0_sink.rollInterval=5400
> agent4.sinks.svc_0_sink.channel = MemoryChannel-2
>
> Any thoughts on how to define host specific directory/file?
>
> Bhaskar
>
> On Tue, Jun 19, 2012 at 10:01 AM, Mohammad Tariq <do...@gmail.com> wrote:
>>
>> Hello Bhaskar,
>>
>>         That's great...And we can use "%{host}" as the escape
>> sequence to prefix our filenames(am I getting you correctly???).And I
>> am waiting anxiously for your guide as I am still a newbie..:-)
>>
>> Regards,
>>     Mohammad Tariq
>>
>>
>> On Tue, Jun 19, 2012 at 7:16 PM, Bhaskar <bm...@gmail.com> wrote:
>> > Thank you guys for the responses.  I actually was able to get around
>> > this
>> > problem by tinkering around with my setting.  I finally ended up with a
>> > capacity of 10000 and commented out transactionCapacity (i originally
>> > set it
>> > to 10) and it started working.  Thanks for the insight.  It took me a
>> > bit of
>> > time to figure out the inner workings of AVRO to get it to send data in
>> > correct format.  So, i got over that hump:-).  Here is my flow for POC.
>> >
>> > Host A agent --> Source tail exec --> AVRO Client Sink --> jdbc channel
>> > (flume-ng avro-client  -H <<Host>> -p <<port>> -F <<file to read>>
>> > --conf
>> > ../conf/)
>> > Host B agent --> Source tail exec --> AVRO Client Sink --> jdbc channel
>> > (flume-ng avro-client  -H <<Host>> -p <<port>> -F <<file to read>>
>> > --conf
>> > ../conf/)
>> > Host C agent --> avro-collector source --> file sink to local directory
>> > --
>> > Memory channel
>> >
>> > The issue i am running into is, I am unable to uniquely identify the
>> > source
>> > of the log in the sink (means the log events from Host A and Host B are
>> > combined into the same log on the disk and mixed up).  Is there a way to
>> > provide unique identifier from the source so that we can track the
>> > origin of
>> > the log?  I am hoping to see in my sink log,
>> >
>> > Host A -- some log entry
>> > Host B -- Some log entry  etc
>> >
>> > Is this feasible or are there any alternative mechanisms to achieve
>> > this?  I
>> > am putting together a new bee guide that might help answer some of these
>> > questions for others as i explore this architecture.
>> >
>> > As always thanks for your assistance,
>> > Bhaskar
>> >
>> >
>> > On Tue, Jun 19, 2012 at 2:59 AM, Juhani Connolly
>> > <ju...@cyberagent.co.jp> wrote:
>> >>
>> >> Hello Bhaskar,
>> >>
>> >> Using Avro is generally the recommended method to handle multi-hop
>> >> flows,
>> >> so no concerns there.
>> >>
>> >> Have you tried this setup using memory channels instead of jdbc? Last
>> >> time
>> >> I tested it, the JDBC channel had poor throughput, so you may be
>> >> getting a
>> >> logjam somewhere. How much data is getting entered into your logfile?
>> >> Try
>> >> raising the capacity on your jdbc channel by a lot(10000?). With a
>> >> capacity
>> >> of 10, if the reading side(host b) isn't polling frequently enough,
>> >> there's
>> >> going to be problems. This is probably why you get the "failed to
>> >> persist
>> >> event". As far as FLUME-1259 is concerned, that should only be
>> >> happening if
>> >> bad data is being sent. You're not sending anything else to the same
>> >> port
>> >> are you? Make sure that only the source and sink are set to that port
>> >> and
>> >> that nothing else is.
>> >>
>> >> If the problem continues, please post a chunk of the logs leading up to
>> >> the OOM error(the full trace for the cause should be enough)
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On 06/16/2012 12:01 AM, Bhaskar wrote:
>> >>
>> >> Sorry to be a pest with stream of questions.  I think i am going two
>> >> steps
>> >> forward and four steps back:-).  After my first successful attempt, i
>> >> tried
>> >> running flume with the following flow:
>> >>
>> >> 1.  HostA
>> >>   -- Source  is tail web server log
>> >>   -- channel jdbc
>> >>   -- sink is AVRO collection on Host B
>> >> Configuraiton:
>> >> agent3.sources = tail
>> >> agent3.sinks = avro-forward-sink
>> >> agent3.channels = jdbc-channel
>> >>
>> >> # Define source flow
>> >> agent3.sources.tail.type = exec
>> >> agent3.sources.tail.command = tail -f /common/log/access.log
>> >> agent3.sources.tail.channels = jdbc-channel
>> >>
>> >> # define the flow
>> >> agent3.sinks.avro-forward-sink.channel = jdbc-channel
>> >>
>> >> # avro sink properties
>> >> agent3.sources.avro-forward-sink.type = avro
>> >> agent3.sources.avro-forward-sink.hostname = <<IP Address>>
>> >> agent3.sources.avro-forward-sink.port = <<PORT>>
>> >>
>> >> # Define channels
>> >> agent3.channels.jdbc-channel.type = jdbc
>> >> agent3.channels.jdbc-channel.maximum.capacity = 10
>> >> agent3.channels.jdbc-channel.maximum.connections = 2
>> >>
>> >>
>> >> 2.  HostB
>> >>   -- Source is AVRO collection
>> >>   -- channel is memory
>> >>   -- sink is local file system
>> >>
>> >> Configuration:
>> >> # list sources, sinks and channels in the agent4
>> >> agent4.sources = avro-collection-source
>> >> agent4.sinks = svc_0_sink
>> >> agent4.channels = MemoryChannel-2
>> >>
>> >> # define the flow
>> >> agent4.sources.avro-collection-source.channels = MemoryChannel-2
>> >> agent4.sinks.svc_0_sink.channel = MemoryChannel-2
>> >>
>> >> # avro sink properties
>> >> agent4.sources.avro-collection-source.type = avro
>> >> agent4.sources.avro-collection-source.bind = <<IP Address>>
>> >> agent4.sources.avro-collection-source.port = <<PORT>>
>> >>
>> >> agent4.sinks.svc_0_sink.type = FILE_ROLL
>> >> agent4.sinks.svc_0_sink.sink.directory=/logs/agent4
>> >> agent4.sinks.svc_0_sink.rollInterval=600
>> >> agent4.sinks.svc_0_sink.channel = MemoryChannel-2
>> >>
>> >> agent4.channels.MemoryChannel-2.type = memory
>> >> agent4.channels.MemoryChannel-2.capacity = 100
>> >> agent4.channels.MemoryChannel-2.transactionCapacity = 10
>> >>
>> >>
>> >> Basically i am trying to tail a file on one host, stream it to another
>> >> host running sink.  During the trial run, the configuration is loaded
>> >> fine
>> >> and i see the channels created fine.  I see an exception from the jdbc
>> >> channel first (Failed to persist event).  I am getting a java heap
>> >> space OOM
>> >> exception from Host B when Host A attempts to write.
>> >>
>> >> 2012-06-15 10:31:44,503 WARN ipc.NettyServer: Unexpected exception from
>> >> downstream.
>> >> java.lang.OutOfMemoryError: Java heap space
>> >>
>> >> This issue was already
>> >> reported https://issues.apache.org/jira/browse/FLUME-1259 but i am not
>> >> sure
>> >> if there is a work around to this problem.  I have couple questions:
>> >>
>> >> 1.  Am i force fitting a wrong solution here using AVRO?
>> >> 2.  if so, what would be a right solution for streaming data from Host
>> >> A
>> >> to Host B (or thru intermediaries)?
>> >>
>> >> Thanks,
>> >> Bhaskar
>> >>
>> >> On Thu, Jun 14, 2012 at 4:31 PM, Mohammad Tariq <do...@gmail.com>
>> >> wrote:
>> >>>
>> >>> Since you are thinking of using multi-hop flow I would suggest to go
>> >>> for "JDBC Channel" as there is higher chance of error than single-hop
>> >>> flow and in JDBC Channel events are stored in a persistent storage
>> >>> that’s backed by a database. For detailed guidelines you can refer
>> >>> Flume 1.x User Guide at -
>> >>>
>> >>>
>> >>> https://people.apache.org/~mpercy/flume/flume-1.2.0-incubating-SNAPSHOT/docs/FlumeUserGuide.html
>> >>>
>> >>> Regards,
>> >>>     Mohammad Tariq
>> >>>
>> >>>
>> >>> On Fri, Jun 15, 2012 at 12:46 AM, Bhaskar <bm...@gmail.com> wrote:
>> >>> > Hi Mohammad,
>> >>> > Thanks for the pointer there.  Do you think using a message queue
>> >>> > (like
>> >>> > rabbitmq) would be a choice of communication channel between each
>> >>> > hop?
>> >>> >  i am
>> >>> > struggling to get a handle on how i need to configure my sink in
>> >>> > intermediary hops in a multi-hop flow.    Appreciate any
>> >>> > guidance/examples.
>> >>> >
>> >>> > Thanks,
>> >>> > Bhaskar
>> >>> >
>> >>> >
>> >>> > On Thu, Jun 14, 2012 at 1:57 PM, Mohammad Tariq <do...@gmail.com>
>> >>> > wrote:
>> >>> >>
>> >>> >> Hello Bhaskar,
>> >>> >>
>> >>> >>      That's great..And the best approach to stream logs depends
>> >>> >> upon
>> >>> >> the type of source you want to watch for..And by looking at your
>> >>> >> usecase, I would suggest to go for "multi-hop" flows where events
>> >>> >> travel through multiple agents before reaching the final
>> >>> >> destination.
>> >>> >>
>> >>> >> Regards,
>> >>> >>     Mohammad Tariq
>> >>> >>
>> >>> >>
>> >>> >> On Thu, Jun 14, 2012 at 10:48 PM, Bhaskar <bm...@gmail.com>
>> >>> >> wrote:
>> >>> >> > I know what i am missing:-)  I missed connecting the sink with
>> >>> >> > the
>> >>> >> > channel.
>> >>> >> >  My small POC works now and i am able to view the streamed logs.
>> >>> >> >  Thank
>> >>> >> > you
>> >>> >> > all for the guidance and patience in answering all questions.
>> >>> >> >  So,
>> >>> >> > whats
>> >>> >> > the
>> >>> >> > best approach to stream logs from other hosts?  Basically my next
>> >>> >> > task
>> >>> >> > would
>> >>> >> > be to set up collector (sort of) model to stream logs to
>> >>> >> > intermediary
>> >>> >> > and
>> >>> >> > then stream from collector to a sink location.  I'd appreciate
>> >>> >> > any
>> >>> >> > thoughts/guidance in this regard.
>> >>> >> >
>> >>> >> > Bhaskar
>> >>> >> >
>> >>> >> >
>> >>> >> > On Thu, Jun 14, 2012 at 12:52 PM, Bhaskar <bm...@gmail.com>
>> >>> >> > wrote:
>> >>> >> >>
>> >>> >> >> For testing purposes, I tried with the following configuration
>> >>> >> >> without
>> >>> >> >> much luck.  I see that the process started fine but it just does
>> >>> >> >> not
>> >>> >> >> write
>> >>> >> >> anything to the sink.  I guess i am missing something here.  Can
>> >>> >> >> one of
>> >>> >> >> you
>> >>> >> >> gurus take a look and suggest what i am doing wrong?
>> >>> >> >>
>> >>> >> >> Thanks,
>> >>> >> >> Bhaskar
>> >>> >> >>
>> >>> >> >> agent1.sources = tail
>> >>> >> >> agent1.channels = MemoryChannel-2
>> >>> >> >> agent1.sinks = svc_0_sink
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> agent1.sources.tail.type = exec
>> >>> >> >> agent1.sources.tail.command = tail -f /var/log/access.log
>> >>> >> >> agent1.sources.tail.channels = MemoryChannel-2
>> >>> >> >>
>> >>> >> >> agent1.sinks.svc_0_sink.type = FILE_ROLL
>> >>> >> >> agent1.sinks.svc_0_sink.sink.directory=/flume_runtime/logs
>> >>> >> >> agent1.sinks.svc_0_sink.rollInterval=0
>> >>> >> >>
>> >>> >> >> agent1.channels.MemoryChannel-2.type = memory
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> On Thu, Jun 14, 2012 at 4:26 AM, Guillaume Polaert
>> >>> >> >> <gp...@cyres.fr>
>> >>> >> >> wrote:
>> >>> >> >>>
>> >>> >> >>> Hi Bhaskar,
>> >>> >> >>>
>> >>> >> >>> This is the flume.conf (http://pastebin.com/WULgUuaf) what I'm
>> >>> >> >>> using.
>> >>> >> >>> I have an avro server on the hadoop-m host and one agent per
>> >>> >> >>> node
>> >>> >> >>> (slave
>> >>> >> >>> hosts). Each agent send the ouput of a exec command to avro
>> >>> >> >>> server.
>> >>> >> >>>
>> >>> >> >>> Host1 : exec -> memory -> avro (sink)
>> >>> >> >>>
>> >>> >> >>> Host2 : exec -> memory -> avro
>> >>> >> >>>                                                >>>>>
>> >>> >> >>>  MainHost :
>> >>> >> >>> avro
>> >>> >> >>> (source) -> memory -> rolling file (local FS)
>> >>> >> >>> ...
>> >>> >> >>>
>> >>> >> >>> Host3 : exec -> memory -> avro
>> >>> >> >>>
>> >>> >> >>>
>> >>> >> >>> Use your own exec command to read Apache log.
>> >>> >> >>>
>> >>> >> >>> Guillaume Polaert | Cyrès Conseil
>> >>> >> >>>
>> >>> >> >>> De : Bhaskar [mailto:bmarthi@gmail.com]
>> >>> >> >>> Envoyé : mercredi 13 juin 2012 19:16
>> >>> >> >>> À : flume-user@incubator.apache.org
>> >>> >> >>> Objet : Newbee question about flume 1.2 set up
>> >>> >> >>>
>> >>> >> >>> Good Afternoon,
>> >>> >> >>> I am a newbee to flume and read thru limited documentation
>> >>> >> >>> available.
>> >>> >> >>>  I
>> >>> >> >>> would like to set up the following to test out.
>> >>> >> >>>
>> >>> >> >>> 1.  Read apache access logs (as source)
>> >>> >> >>> 2.  Use memory channel
>> >>> >> >>> 3.  Write it to a NFS (or even local) file system
>> >>> >> >>>
>> >>> >> >>> Can some one help me with the necessary configuration.  I am
>> >>> >> >>> having
>> >>> >> >>> difficult time to glean that information from available
>> >>> >> >>> documentation.
>> >>> >> >>>  I am
>> >>> >> >>> sure someone has done such test before and i appreciate if you
>> >>> >> >>> can
>> >>> >> >>> pass on
>> >>> >> >>> that information.  Secondly, I also would like to stream the
>> >>> >> >>> logs
>> >>> >> >>> to a
>> >>> >> >>> remote server.  Is that a log4j configuration or do i need to
>> >>> >> >>> run
>> >>> >> >>> an
>> >>> >> >>> agent
>> >>> >> >>> on each host to do so?  Any configuration examples would be of
>> >>> >> >>> great
>> >>> >> >>> help.
>> >>> >> >>>
>> >>> >> >>> Thanks,
>> >>> >> >>> Bhaskar
>> >>> >> >>
>> >>> >> >>
>> >>> >> >
>> >>> >
>> >>> >
>> >>
>> >>
>> >>
>> >
>
>

Re: Newbee question about flume 1.2 set up

Posted by Bhaskar <bm...@gmail.com>.
Unfortunately, that part is not working as expected.  Must be my mistake
somewhere in the configuration.  Here is my sink configuration.

agent4.sinks.svc_0_sink.type = FILE_ROLL
agent4.sinks.svc_0_sink.sink.directory=/var/logs/agent4/%{host}
agent4.sinks.svc_0_sink.rollInterval=5400
agent4.sinks.svc_0_sink.channel = MemoryChannel-2

Any thoughts on how to define host specific directory/file?

Bhaskar

On Tue, Jun 19, 2012 at 10:01 AM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello Bhaskar,
>
>         That's great...And we can use "%{host}" as the escape
> sequence to prefix our filenames(am I getting you correctly???).And I
> am waiting anxiously for your guide as I am still a newbie..:-)
>
> Regards,
>     Mohammad Tariq
>
>
> On Tue, Jun 19, 2012 at 7:16 PM, Bhaskar <bm...@gmail.com> wrote:
> > Thank you guys for the responses.  I actually was able to get around this
> > problem by tinkering around with my setting.  I finally ended up with a
> > capacity of 10000 and commented out transactionCapacity (i originally
> set it
> > to 10) and it started working.  Thanks for the insight.  It took me a
> bit of
> > time to figure out the inner workings of AVRO to get it to send data in
> > correct format.  So, i got over that hump:-).  Here is my flow for POC.
> >
> > Host A agent --> Source tail exec --> AVRO Client Sink --> jdbc channel
> > (flume-ng avro-client  -H <<Host>> -p <<port>> -F <<file to read>> --conf
> > ../conf/)
> > Host B agent --> Source tail exec --> AVRO Client Sink --> jdbc channel
> > (flume-ng avro-client  -H <<Host>> -p <<port>> -F <<file to read>> --conf
> > ../conf/)
> > Host C agent --> avro-collector source --> file sink to local directory
> --
> > Memory channel
> >
> > The issue i am running into is, I am unable to uniquely identify the
> source
> > of the log in the sink (means the log events from Host A and Host B are
> > combined into the same log on the disk and mixed up).  Is there a way to
> > provide unique identifier from the source so that we can track the
> origin of
> > the log?  I am hoping to see in my sink log,
> >
> > Host A -- some log entry
> > Host B -- Some log entry  etc
> >
> > Is this feasible or are there any alternative mechanisms to achieve
> this?  I
> > am putting together a new bee guide that might help answer some of these
> > questions for others as i explore this architecture.
> >
> > As always thanks for your assistance,
> > Bhaskar
> >
> >
> > On Tue, Jun 19, 2012 at 2:59 AM, Juhani Connolly
> > <ju...@cyberagent.co.jp> wrote:
> >>
> >> Hello Bhaskar,
> >>
> >> Using Avro is generally the recommended method to handle multi-hop
> flows,
> >> so no concerns there.
> >>
> >> Have you tried this setup using memory channels instead of jdbc? Last
> time
> >> I tested it, the JDBC channel had poor throughput, so you may be
> getting a
> >> logjam somewhere. How much data is getting entered into your logfile?
> Try
> >> raising the capacity on your jdbc channel by a lot(10000?). With a
> capacity
> >> of 10, if the reading side(host b) isn't polling frequently enough,
> there's
> >> going to be problems. This is probably why you get the "failed to
> persist
> >> event". As far as FLUME-1259 is concerned, that should only be
> happening if
> >> bad data is being sent. You're not sending anything else to the same
> port
> >> are you? Make sure that only the source and sink are set to that port
> and
> >> that nothing else is.
> >>
> >> If the problem continues, please post a chunk of the logs leading up to
> >> the OOM error(the full trace for the cause should be enough)
> >>
> >>
> >>
> >>
> >>
> >>
> >> On 06/16/2012 12:01 AM, Bhaskar wrote:
> >>
> >> Sorry to be a pest with stream of questions.  I think i am going two
> steps
> >> forward and four steps back:-).  After my first successful attempt, i
> tried
> >> running flume with the following flow:
> >>
> >> 1.  HostA
> >>   -- Source  is tail web server log
> >>   -- channel jdbc
> >>   -- sink is AVRO collection on Host B
> >> Configuraiton:
> >> agent3.sources = tail
> >> agent3.sinks = avro-forward-sink
> >> agent3.channels = jdbc-channel
> >>
> >> # Define source flow
> >> agent3.sources.tail.type = exec
> >> agent3.sources.tail.command = tail -f /common/log/access.log
> >> agent3.sources.tail.channels = jdbc-channel
> >>
> >> # define the flow
> >> agent3.sinks.avro-forward-sink.channel = jdbc-channel
> >>
> >> # avro sink properties
> >> agent3.sources.avro-forward-sink.type = avro
> >> agent3.sources.avro-forward-sink.hostname = <<IP Address>>
> >> agent3.sources.avro-forward-sink.port = <<PORT>>
> >>
> >> # Define channels
> >> agent3.channels.jdbc-channel.type = jdbc
> >> agent3.channels.jdbc-channel.maximum.capacity = 10
> >> agent3.channels.jdbc-channel.maximum.connections = 2
> >>
> >>
> >> 2.  HostB
> >>   -- Source is AVRO collection
> >>   -- channel is memory
> >>   -- sink is local file system
> >>
> >> Configuration:
> >> # list sources, sinks and channels in the agent4
> >> agent4.sources = avro-collection-source
> >> agent4.sinks = svc_0_sink
> >> agent4.channels = MemoryChannel-2
> >>
> >> # define the flow
> >> agent4.sources.avro-collection-source.channels = MemoryChannel-2
> >> agent4.sinks.svc_0_sink.channel = MemoryChannel-2
> >>
> >> # avro sink properties
> >> agent4.sources.avro-collection-source.type = avro
> >> agent4.sources.avro-collection-source.bind = <<IP Address>>
> >> agent4.sources.avro-collection-source.port = <<PORT>>
> >>
> >> agent4.sinks.svc_0_sink.type = FILE_ROLL
> >> agent4.sinks.svc_0_sink.sink.directory=/logs/agent4
> >> agent4.sinks.svc_0_sink.rollInterval=600
> >> agent4.sinks.svc_0_sink.channel = MemoryChannel-2
> >>
> >> agent4.channels.MemoryChannel-2.type = memory
> >> agent4.channels.MemoryChannel-2.capacity = 100
> >> agent4.channels.MemoryChannel-2.transactionCapacity = 10
> >>
> >>
> >> Basically i am trying to tail a file on one host, stream it to another
> >> host running sink.  During the trial run, the configuration is loaded
> fine
> >> and i see the channels created fine.  I see an exception from the jdbc
> >> channel first (Failed to persist event).  I am getting a java heap
> space OOM
> >> exception from Host B when Host A attempts to write.
> >>
> >> 2012-06-15 10:31:44,503 WARN ipc.NettyServer: Unexpected exception from
> >> downstream.
> >> java.lang.OutOfMemoryError: Java heap space
> >>
> >> This issue was already
> >> reported https://issues.apache.org/jira/browse/FLUME-1259 but i am not
> sure
> >> if there is a work around to this problem.  I have couple questions:
> >>
> >> 1.  Am i force fitting a wrong solution here using AVRO?
> >> 2.  if so, what would be a right solution for streaming data from Host A
> >> to Host B (or thru intermediaries)?
> >>
> >> Thanks,
> >> Bhaskar
> >>
> >> On Thu, Jun 14, 2012 at 4:31 PM, Mohammad Tariq <do...@gmail.com>
> >> wrote:
> >>>
> >>> Since you are thinking of using multi-hop flow I would suggest to go
> >>> for "JDBC Channel" as there is higher chance of error than single-hop
> >>> flow and in JDBC Channel events are stored in a persistent storage
> >>> that’s backed by a database. For detailed guidelines you can refer
> >>> Flume 1.x User Guide at -
> >>>
> >>>
> https://people.apache.org/~mpercy/flume/flume-1.2.0-incubating-SNAPSHOT/docs/FlumeUserGuide.html
> >>>
> >>> Regards,
> >>>     Mohammad Tariq
> >>>
> >>>
> >>> On Fri, Jun 15, 2012 at 12:46 AM, Bhaskar <bm...@gmail.com> wrote:
> >>> > Hi Mohammad,
> >>> > Thanks for the pointer there.  Do you think using a message queue
> (like
> >>> > rabbitmq) would be a choice of communication channel between each
> hop?
> >>> >  i am
> >>> > struggling to get a handle on how i need to configure my sink in
> >>> > intermediary hops in a multi-hop flow.    Appreciate any
> >>> > guidance/examples.
> >>> >
> >>> > Thanks,
> >>> > Bhaskar
> >>> >
> >>> >
> >>> > On Thu, Jun 14, 2012 at 1:57 PM, Mohammad Tariq <do...@gmail.com>
> >>> > wrote:
> >>> >>
> >>> >> Hello Bhaskar,
> >>> >>
> >>> >>      That's great..And the best approach to stream logs depends upon
> >>> >> the type of source you want to watch for..And by looking at your
> >>> >> usecase, I would suggest to go for "multi-hop" flows where events
> >>> >> travel through multiple agents before reaching the final
> destination.
> >>> >>
> >>> >> Regards,
> >>> >>     Mohammad Tariq
> >>> >>
> >>> >>
> >>> >> On Thu, Jun 14, 2012 at 10:48 PM, Bhaskar <bm...@gmail.com>
> wrote:
> >>> >> > I know what i am missing:-)  I missed connecting the sink with the
> >>> >> > channel.
> >>> >> >  My small POC works now and i am able to view the streamed logs.
> >>> >> >  Thank
> >>> >> > you
> >>> >> > all for the guidance and patience in answering all questions.  So,
> >>> >> > whats
> >>> >> > the
> >>> >> > best approach to stream logs from other hosts?  Basically my next
> >>> >> > task
> >>> >> > would
> >>> >> > be to set up collector (sort of) model to stream logs to
> >>> >> > intermediary
> >>> >> > and
> >>> >> > then stream from collector to a sink location.  I'd appreciate any
> >>> >> > thoughts/guidance in this regard.
> >>> >> >
> >>> >> > Bhaskar
> >>> >> >
> >>> >> >
> >>> >> > On Thu, Jun 14, 2012 at 12:52 PM, Bhaskar <bm...@gmail.com>
> wrote:
> >>> >> >>
> >>> >> >> For testing purposes, I tried with the following configuration
> >>> >> >> without
> >>> >> >> much luck.  I see that the process started fine but it just does
> >>> >> >> not
> >>> >> >> write
> >>> >> >> anything to the sink.  I guess i am missing something here.  Can
> >>> >> >> one of
> >>> >> >> you
> >>> >> >> gurus take a look and suggest what i am doing wrong?
> >>> >> >>
> >>> >> >> Thanks,
> >>> >> >> Bhaskar
> >>> >> >>
> >>> >> >> agent1.sources = tail
> >>> >> >> agent1.channels = MemoryChannel-2
> >>> >> >> agent1.sinks = svc_0_sink
> >>> >> >>
> >>> >> >>
> >>> >> >> agent1.sources.tail.type = exec
> >>> >> >> agent1.sources.tail.command = tail -f /var/log/access.log
> >>> >> >> agent1.sources.tail.channels = MemoryChannel-2
> >>> >> >>
> >>> >> >> agent1.sinks.svc_0_sink.type = FILE_ROLL
> >>> >> >> agent1.sinks.svc_0_sink.sink.directory=/flume_runtime/logs
> >>> >> >> agent1.sinks.svc_0_sink.rollInterval=0
> >>> >> >>
> >>> >> >> agent1.channels.MemoryChannel-2.type = memory
> >>> >> >>
> >>> >> >>
> >>> >> >> On Thu, Jun 14, 2012 at 4:26 AM, Guillaume Polaert
> >>> >> >> <gp...@cyres.fr>
> >>> >> >> wrote:
> >>> >> >>>
> >>> >> >>> Hi Bhaskar,
> >>> >> >>>
> >>> >> >>> This is the flume.conf (http://pastebin.com/WULgUuaf) what I'm
> >>> >> >>> using.
> >>> >> >>> I have an avro server on the hadoop-m host and one agent per
> node
> >>> >> >>> (slave
> >>> >> >>> hosts). Each agent send the ouput of a exec command to avro
> >>> >> >>> server.
> >>> >> >>>
> >>> >> >>> Host1 : exec -> memory -> avro (sink)
> >>> >> >>>
> >>> >> >>> Host2 : exec -> memory -> avro
> >>> >> >>>                                                >>>>>
>  MainHost :
> >>> >> >>> avro
> >>> >> >>> (source) -> memory -> rolling file (local FS)
> >>> >> >>> ...
> >>> >> >>>
> >>> >> >>> Host3 : exec -> memory -> avro
> >>> >> >>>
> >>> >> >>>
> >>> >> >>> Use your own exec command to read Apache log.
> >>> >> >>>
> >>> >> >>> Guillaume Polaert | Cyrès Conseil
> >>> >> >>>
> >>> >> >>> De : Bhaskar [mailto:bmarthi@gmail.com]
> >>> >> >>> Envoyé : mercredi 13 juin 2012 19:16
> >>> >> >>> À : flume-user@incubator.apache.org
> >>> >> >>> Objet : Newbee question about flume 1.2 set up
> >>> >> >>>
> >>> >> >>> Good Afternoon,
> >>> >> >>> I am a newbee to flume and read thru limited documentation
> >>> >> >>> available.
> >>> >> >>>  I
> >>> >> >>> would like to set up the following to test out.
> >>> >> >>>
> >>> >> >>> 1.  Read apache access logs (as source)
> >>> >> >>> 2.  Use memory channel
> >>> >> >>> 3.  Write it to a NFS (or even local) file system
> >>> >> >>>
> >>> >> >>> Can some one help me with the necessary configuration.  I am
> >>> >> >>> having
> >>> >> >>> difficult time to glean that information from available
> >>> >> >>> documentation.
> >>> >> >>>  I am
> >>> >> >>> sure someone has done such test before and i appreciate if you
> can
> >>> >> >>> pass on
> >>> >> >>> that information.  Secondly, I also would like to stream the
> logs
> >>> >> >>> to a
> >>> >> >>> remote server.  Is that a log4j configuration or do i need to
> run
> >>> >> >>> an
> >>> >> >>> agent
> >>> >> >>> on each host to do so?  Any configuration examples would be of
> >>> >> >>> great
> >>> >> >>> help.
> >>> >> >>>
> >>> >> >>> Thanks,
> >>> >> >>> Bhaskar
> >>> >> >>
> >>> >> >>
> >>> >> >
> >>> >
> >>> >
> >>
> >>
> >>
> >
>

Re: Newbee question about flume 1.2 set up

Posted by Mohammad Tariq <do...@gmail.com>.
Hello Bhaskar,

         That's great...And we can use "%{host}" as the escape
sequence to prefix our filenames(am I getting you correctly???).And I
am waiting anxiously for your guide as I am still a newbie..:-)

Regards,
    Mohammad Tariq


On Tue, Jun 19, 2012 at 7:16 PM, Bhaskar <bm...@gmail.com> wrote:
> Thank you guys for the responses.  I actually was able to get around this
> problem by tinkering around with my setting.  I finally ended up with a
> capacity of 10000 and commented out transactionCapacity (i originally set it
> to 10) and it started working.  Thanks for the insight.  It took me a bit of
> time to figure out the inner workings of AVRO to get it to send data in
> correct format.  So, i got over that hump:-).  Here is my flow for POC.
>
> Host A agent --> Source tail exec --> AVRO Client Sink --> jdbc channel
> (flume-ng avro-client  -H <<Host>> -p <<port>> -F <<file to read>> --conf
> ../conf/)
> Host B agent --> Source tail exec --> AVRO Client Sink --> jdbc channel
> (flume-ng avro-client  -H <<Host>> -p <<port>> -F <<file to read>> --conf
> ../conf/)
> Host C agent --> avro-collector source --> file sink to local directory --
> Memory channel
>
> The issue i am running into is, I am unable to uniquely identify the source
> of the log in the sink (means the log events from Host A and Host B are
> combined into the same log on the disk and mixed up).  Is there a way to
> provide unique identifier from the source so that we can track the origin of
> the log?  I am hoping to see in my sink log,
>
> Host A -- some log entry
> Host B -- Some log entry  etc
>
> Is this feasible or are there any alternative mechanisms to achieve this?  I
> am putting together a new bee guide that might help answer some of these
> questions for others as i explore this architecture.
>
> As always thanks for your assistance,
> Bhaskar
>
>
> On Tue, Jun 19, 2012 at 2:59 AM, Juhani Connolly
> <ju...@cyberagent.co.jp> wrote:
>>
>> Hello Bhaskar,
>>
>> Using Avro is generally the recommended method to handle multi-hop flows,
>> so no concerns there.
>>
>> Have you tried this setup using memory channels instead of jdbc? Last time
>> I tested it, the JDBC channel had poor throughput, so you may be getting a
>> logjam somewhere. How much data is getting entered into your logfile? Try
>> raising the capacity on your jdbc channel by a lot(10000?). With a capacity
>> of 10, if the reading side(host b) isn't polling frequently enough, there's
>> going to be problems. This is probably why you get the "failed to persist
>> event". As far as FLUME-1259 is concerned, that should only be happening if
>> bad data is being sent. You're not sending anything else to the same port
>> are you? Make sure that only the source and sink are set to that port and
>> that nothing else is.
>>
>> If the problem continues, please post a chunk of the logs leading up to
>> the OOM error(the full trace for the cause should be enough)
>>
>>
>>
>>
>>
>>
>> On 06/16/2012 12:01 AM, Bhaskar wrote:
>>
>> Sorry to be a pest with stream of questions.  I think i am going two steps
>> forward and four steps back:-).  After my first successful attempt, i tried
>> running flume with the following flow:
>>
>> 1.  HostA
>>   -- Source  is tail web server log
>>   -- channel jdbc
>>   -- sink is AVRO collection on Host B
>> Configuraiton:
>> agent3.sources = tail
>> agent3.sinks = avro-forward-sink
>> agent3.channels = jdbc-channel
>>
>> # Define source flow
>> agent3.sources.tail.type = exec
>> agent3.sources.tail.command = tail -f /common/log/access.log
>> agent3.sources.tail.channels = jdbc-channel
>>
>> # define the flow
>> agent3.sinks.avro-forward-sink.channel = jdbc-channel
>>
>> # avro sink properties
>> agent3.sources.avro-forward-sink.type = avro
>> agent3.sources.avro-forward-sink.hostname = <<IP Address>>
>> agent3.sources.avro-forward-sink.port = <<PORT>>
>>
>> # Define channels
>> agent3.channels.jdbc-channel.type = jdbc
>> agent3.channels.jdbc-channel.maximum.capacity = 10
>> agent3.channels.jdbc-channel.maximum.connections = 2
>>
>>
>> 2.  HostB
>>   -- Source is AVRO collection
>>   -- channel is memory
>>   -- sink is local file system
>>
>> Configuration:
>> # list sources, sinks and channels in the agent4
>> agent4.sources = avro-collection-source
>> agent4.sinks = svc_0_sink
>> agent4.channels = MemoryChannel-2
>>
>> # define the flow
>> agent4.sources.avro-collection-source.channels = MemoryChannel-2
>> agent4.sinks.svc_0_sink.channel = MemoryChannel-2
>>
>> # avro sink properties
>> agent4.sources.avro-collection-source.type = avro
>> agent4.sources.avro-collection-source.bind = <<IP Address>>
>> agent4.sources.avro-collection-source.port = <<PORT>>
>>
>> agent4.sinks.svc_0_sink.type = FILE_ROLL
>> agent4.sinks.svc_0_sink.sink.directory=/logs/agent4
>> agent4.sinks.svc_0_sink.rollInterval=600
>> agent4.sinks.svc_0_sink.channel = MemoryChannel-2
>>
>> agent4.channels.MemoryChannel-2.type = memory
>> agent4.channels.MemoryChannel-2.capacity = 100
>> agent4.channels.MemoryChannel-2.transactionCapacity = 10
>>
>>
>> Basically i am trying to tail a file on one host, stream it to another
>> host running sink.  During the trial run, the configuration is loaded fine
>> and i see the channels created fine.  I see an exception from the jdbc
>> channel first (Failed to persist event).  I am getting a java heap space OOM
>> exception from Host B when Host A attempts to write.
>>
>> 2012-06-15 10:31:44,503 WARN ipc.NettyServer: Unexpected exception from
>> downstream.
>> java.lang.OutOfMemoryError: Java heap space
>>
>> This issue was already
>> reported https://issues.apache.org/jira/browse/FLUME-1259 but i am not sure
>> if there is a work around to this problem.  I have couple questions:
>>
>> 1.  Am i force fitting a wrong solution here using AVRO?
>> 2.  if so, what would be a right solution for streaming data from Host A
>> to Host B (or thru intermediaries)?
>>
>> Thanks,
>> Bhaskar
>>
>> On Thu, Jun 14, 2012 at 4:31 PM, Mohammad Tariq <do...@gmail.com>
>> wrote:
>>>
>>> Since you are thinking of using multi-hop flow I would suggest to go
>>> for "JDBC Channel" as there is higher chance of error than single-hop
>>> flow and in JDBC Channel events are stored in a persistent storage
>>> that’s backed by a database. For detailed guidelines you can refer
>>> Flume 1.x User Guide at -
>>>
>>> https://people.apache.org/~mpercy/flume/flume-1.2.0-incubating-SNAPSHOT/docs/FlumeUserGuide.html
>>>
>>> Regards,
>>>     Mohammad Tariq
>>>
>>>
>>> On Fri, Jun 15, 2012 at 12:46 AM, Bhaskar <bm...@gmail.com> wrote:
>>> > Hi Mohammad,
>>> > Thanks for the pointer there.  Do you think using a message queue (like
>>> > rabbitmq) would be a choice of communication channel between each hop?
>>> >  i am
>>> > struggling to get a handle on how i need to configure my sink in
>>> > intermediary hops in a multi-hop flow.    Appreciate any
>>> > guidance/examples.
>>> >
>>> > Thanks,
>>> > Bhaskar
>>> >
>>> >
>>> > On Thu, Jun 14, 2012 at 1:57 PM, Mohammad Tariq <do...@gmail.com>
>>> > wrote:
>>> >>
>>> >> Hello Bhaskar,
>>> >>
>>> >>      That's great..And the best approach to stream logs depends upon
>>> >> the type of source you want to watch for..And by looking at your
>>> >> usecase, I would suggest to go for "multi-hop" flows where events
>>> >> travel through multiple agents before reaching the final destination.
>>> >>
>>> >> Regards,
>>> >>     Mohammad Tariq
>>> >>
>>> >>
>>> >> On Thu, Jun 14, 2012 at 10:48 PM, Bhaskar <bm...@gmail.com> wrote:
>>> >> > I know what i am missing:-)  I missed connecting the sink with the
>>> >> > channel.
>>> >> >  My small POC works now and i am able to view the streamed logs.
>>> >> >  Thank
>>> >> > you
>>> >> > all for the guidance and patience in answering all questions.  So,
>>> >> > whats
>>> >> > the
>>> >> > best approach to stream logs from other hosts?  Basically my next
>>> >> > task
>>> >> > would
>>> >> > be to set up collector (sort of) model to stream logs to
>>> >> > intermediary
>>> >> > and
>>> >> > then stream from collector to a sink location.  I'd appreciate any
>>> >> > thoughts/guidance in this regard.
>>> >> >
>>> >> > Bhaskar
>>> >> >
>>> >> >
>>> >> > On Thu, Jun 14, 2012 at 12:52 PM, Bhaskar <bm...@gmail.com> wrote:
>>> >> >>
>>> >> >> For testing purposes, I tried with the following configuration
>>> >> >> without
>>> >> >> much luck.  I see that the process started fine but it just does
>>> >> >> not
>>> >> >> write
>>> >> >> anything to the sink.  I guess i am missing something here.  Can
>>> >> >> one of
>>> >> >> you
>>> >> >> gurus take a look and suggest what i am doing wrong?
>>> >> >>
>>> >> >> Thanks,
>>> >> >> Bhaskar
>>> >> >>
>>> >> >> agent1.sources = tail
>>> >> >> agent1.channels = MemoryChannel-2
>>> >> >> agent1.sinks = svc_0_sink
>>> >> >>
>>> >> >>
>>> >> >> agent1.sources.tail.type = exec
>>> >> >> agent1.sources.tail.command = tail -f /var/log/access.log
>>> >> >> agent1.sources.tail.channels = MemoryChannel-2
>>> >> >>
>>> >> >> agent1.sinks.svc_0_sink.type = FILE_ROLL
>>> >> >> agent1.sinks.svc_0_sink.sink.directory=/flume_runtime/logs
>>> >> >> agent1.sinks.svc_0_sink.rollInterval=0
>>> >> >>
>>> >> >> agent1.channels.MemoryChannel-2.type = memory
>>> >> >>
>>> >> >>
>>> >> >> On Thu, Jun 14, 2012 at 4:26 AM, Guillaume Polaert
>>> >> >> <gp...@cyres.fr>
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> Hi Bhaskar,
>>> >> >>>
>>> >> >>> This is the flume.conf (http://pastebin.com/WULgUuaf) what I'm
>>> >> >>> using.
>>> >> >>> I have an avro server on the hadoop-m host and one agent per node
>>> >> >>> (slave
>>> >> >>> hosts). Each agent send the ouput of a exec command to avro
>>> >> >>> server.
>>> >> >>>
>>> >> >>> Host1 : exec -> memory -> avro (sink)
>>> >> >>>
>>> >> >>> Host2 : exec -> memory -> avro
>>> >> >>>                                                >>>>>    MainHost :
>>> >> >>> avro
>>> >> >>> (source) -> memory -> rolling file (local FS)
>>> >> >>> ...
>>> >> >>>
>>> >> >>> Host3 : exec -> memory -> avro
>>> >> >>>
>>> >> >>>
>>> >> >>> Use your own exec command to read Apache log.
>>> >> >>>
>>> >> >>> Guillaume Polaert | Cyrès Conseil
>>> >> >>>
>>> >> >>> De : Bhaskar [mailto:bmarthi@gmail.com]
>>> >> >>> Envoyé : mercredi 13 juin 2012 19:16
>>> >> >>> À : flume-user@incubator.apache.org
>>> >> >>> Objet : Newbee question about flume 1.2 set up
>>> >> >>>
>>> >> >>> Good Afternoon,
>>> >> >>> I am a newbee to flume and read thru limited documentation
>>> >> >>> available.
>>> >> >>>  I
>>> >> >>> would like to set up the following to test out.
>>> >> >>>
>>> >> >>> 1.  Read apache access logs (as source)
>>> >> >>> 2.  Use memory channel
>>> >> >>> 3.  Write it to a NFS (or even local) file system
>>> >> >>>
>>> >> >>> Can some one help me with the necessary configuration.  I am
>>> >> >>> having
>>> >> >>> difficult time to glean that information from available
>>> >> >>> documentation.
>>> >> >>>  I am
>>> >> >>> sure someone has done such test before and i appreciate if you can
>>> >> >>> pass on
>>> >> >>> that information.  Secondly, I also would like to stream the logs
>>> >> >>> to a
>>> >> >>> remote server.  Is that a log4j configuration or do i need to run
>>> >> >>> an
>>> >> >>> agent
>>> >> >>> on each host to do so?  Any configuration examples would be of
>>> >> >>> great
>>> >> >>> help.
>>> >> >>>
>>> >> >>> Thanks,
>>> >> >>> Bhaskar
>>> >> >>
>>> >> >>
>>> >> >
>>> >
>>> >
>>
>>
>>
>

Re: Newbee question about flume 1.2 set up

Posted by Bhaskar <bm...@gmail.com>.
Thank you guys for the responses.  I actually was able to get around this
problem by tinkering around with my setting.  I finally ended up with a
capacity of 10000 and commented out transactionCapacity (i originally set
it to 10) and it started working.  Thanks for the insight.  It took me a
bit of time to figure out the inner workings of AVRO to get it to send data
in correct format.  So, i got over that hump:-).  Here is my flow for POC.

Host A agent --> Source tail exec --> AVRO Client Sink --> jdbc channel
(flume-ng avro-client  -H <<Host>> -p <<port>> -F <<file to read>> --conf
../conf/)
Host B agent --> Source tail exec --> AVRO Client Sink --> jdbc channel
(flume-ng avro-client  -H <<Host>> -p <<port>> -F <<file to read>> --conf
../conf/)
Host C agent --> avro-collector source --> file sink to local directory --
Memory channel

The issue i am running into is, I am unable to uniquely identify the source
of the log in the sink (means the log events from Host A and Host B are
combined into the same log on the disk and mixed up).  Is there a way to
provide unique identifier from the source so that we can track the origin
of the log?  I am hoping to see in my sink log,

Host A -- some log entry
Host B -- Some log entry  etc

Is this feasible or are there any alternative mechanisms to achieve this?
I am putting together a new bee guide that might help answer some of these
questions for others as i explore this architecture.

As always thanks for your assistance,
Bhaskar

On Tue, Jun 19, 2012 at 2:59 AM, Juhani Connolly <
juhani_connolly@cyberagent.co.jp> wrote:

>  Hello Bhaskar,
>
> Using Avro is generally the recommended method to handle multi-hop flows,
> so no concerns there.
>
> Have you tried this setup using memory channels instead of jdbc? Last time
> I tested it, the JDBC channel had poor throughput, so you may be getting a
> logjam somewhere. How much data is getting entered into your logfile? Try
> raising the capacity on your jdbc channel by a lot(10000?). With a capacity
> of 10, if the reading side(host b) isn't polling frequently enough, there's
> going to be problems. This is probably why you get the "failed to persist
> event". As far as FLUME-1259 is concerned, that should only be happening if
> bad data is being sent. You're not sending anything else to the same port
> are you? Make sure that only the source and sink are set to that port and
> that nothing else is.
>
> If the problem continues, please post a chunk of the logs leading up to
> the OOM error(the full trace for the cause should be enough)
>
>
>
>
>
>
> On 06/16/2012 12:01 AM, Bhaskar wrote:
>
> Sorry to be a pest with stream of questions.  I think i am going two steps
> forward and four steps back:-).  After my first successful attempt, i tried
> running flume with the following flow:
>
>  1.  HostA
>   -- Source  is tail web server log
>   -- channel jdbc
>   -- sink is AVRO collection on Host B
> Configuraiton:
>  agent3.sources = tail
> agent3.sinks = avro-forward-sink
> agent3.channels = jdbc-channel
>
>  # Define source flow
> agent3.sources.tail.type = exec
> agent3.sources.tail.command = tail -f /common/log/access.log
> agent3.sources.tail.channels = jdbc-channel
>
>  # define the flow
> agent3.sinks.avro-forward-sink.channel = jdbc-channel
>
>  # avro sink properties
> agent3.sources.avro-forward-sink.type = avro
> agent3.sources.avro-forward-sink.hostname = <<IP Address>>
> agent3.sources.avro-forward-sink.port = <<PORT>>
>
>  # Define channels
> agent3.channels.jdbc-channel.type = jdbc
> agent3.channels.jdbc-channel.maximum.capacity = 10
> agent3.channels.jdbc-channel.maximum.connections = 2
>
>
>  2.  HostB
>   -- Source is AVRO collection
>   -- channel is memory
>   -- sink is local file system
>
>  Configuration:
>  # list sources, sinks and channels in the agent4
> agent4.sources = avro-collection-source
> agent4.sinks = svc_0_sink
> agent4.channels = MemoryChannel-2
>
>  # define the flow
> agent4.sources.avro-collection-source.channels = MemoryChannel-2
> agent4.sinks.svc_0_sink.channel = MemoryChannel-2
>
>  # avro sink properties
> agent4.sources.avro-collection-source.type = avro
> agent4.sources.avro-collection-source.bind = <<IP Address>>
> agent4.sources.avro-collection-source.port = <<PORT>>
>
>  agent4.sinks.svc_0_sink.type = FILE_ROLL
> agent4.sinks.svc_0_sink.sink.directory=/logs/agent4
> agent4.sinks.svc_0_sink.rollInterval=600
> agent4.sinks.svc_0_sink.channel = MemoryChannel-2
>
>  agent4.channels.MemoryChannel-2.type = memory
> agent4.channels.MemoryChannel-2.capacity = 100
> agent4.channels.MemoryChannel-2.transactionCapacity = 10
>
>
>  Basically i am trying to tail a file on one host, stream it to another
> host running sink.  During the trial run, the configuration is loaded fine
> and i see the channels created fine.  I see an exception from the jdbc
> channel first (Failed to persist event).  I am getting a java heap space
> OOM exception from Host B when Host A attempts to write.
>
>  2012-06-15 10:31:44,503 WARN ipc.NettyServer: Unexpected exception from
> downstream.
> java.lang.OutOfMemoryError: Java heap space
>
>  This issue was already reported
> https://issues.apache.org/jira/browse/FLUME-1259 but i am not sure if
> there is a work around to this problem.  I have couple questions:
>
>  1.  Am i force fitting a wrong solution here using AVRO?
> 2.  if so, what would be a right solution for streaming data from Host A
> to Host B (or thru intermediaries)?
>
>  Thanks,
> Bhaskar
>
> On Thu, Jun 14, 2012 at 4:31 PM, Mohammad Tariq <do...@gmail.com>wrote:
>
>> Since you are thinking of using multi-hop flow I would suggest to go
>> for "JDBC Channel" as there is higher chance of error than single-hop
>> flow and in JDBC Channel events are stored in a persistent storage
>> that’s backed by a database. For detailed guidelines you can refer
>> Flume 1.x User Guide at -
>>
>> https://people.apache.org/~mpercy/flume/flume-1.2.0-incubating-SNAPSHOT/docs/FlumeUserGuide.html
>>
>> Regards,
>>     Mohammad Tariq
>>
>>
>> On Fri, Jun 15, 2012 at 12:46 AM, Bhaskar <bm...@gmail.com> wrote:
>> > Hi Mohammad,
>> > Thanks for the pointer there.  Do you think using a message queue (like
>> > rabbitmq) would be a choice of communication channel between each hop?
>>  i am
>> > struggling to get a handle on how i need to configure my sink in
>> > intermediary hops in a multi-hop flow.    Appreciate any
>> guidance/examples.
>> >
>> > Thanks,
>> > Bhaskar
>> >
>> >
>> > On Thu, Jun 14, 2012 at 1:57 PM, Mohammad Tariq <do...@gmail.com>
>> wrote:
>> >>
>> >> Hello Bhaskar,
>> >>
>> >>      That's great..And the best approach to stream logs depends upon
>> >> the type of source you want to watch for..And by looking at your
>> >> usecase, I would suggest to go for "multi-hop" flows where events
>> >> travel through multiple agents before reaching the final destination.
>> >>
>> >> Regards,
>> >>     Mohammad Tariq
>> >>
>> >>
>> >> On Thu, Jun 14, 2012 at 10:48 PM, Bhaskar <bm...@gmail.com> wrote:
>> >> > I know what i am missing:-)  I missed connecting the sink with the
>> >> > channel.
>> >> >  My small POC works now and i am able to view the streamed logs.
>>  Thank
>> >> > you
>> >> > all for the guidance and patience in answering all questions.  So,
>> whats
>> >> > the
>> >> > best approach to stream logs from other hosts?  Basically my next
>> task
>> >> > would
>> >> > be to set up collector (sort of) model to stream logs to intermediary
>> >> > and
>> >> > then stream from collector to a sink location.  I'd appreciate any
>> >> > thoughts/guidance in this regard.
>> >> >
>> >> > Bhaskar
>> >> >
>> >> >
>> >> > On Thu, Jun 14, 2012 at 12:52 PM, Bhaskar <bm...@gmail.com> wrote:
>> >> >>
>> >> >> For testing purposes, I tried with the following configuration
>> without
>> >> >> much luck.  I see that the process started fine but it just does not
>> >> >> write
>> >> >> anything to the sink.  I guess i am missing something here.  Can
>> one of
>> >> >> you
>> >> >> gurus take a look and suggest what i am doing wrong?
>> >> >>
>> >> >> Thanks,
>> >> >> Bhaskar
>> >> >>
>> >> >> agent1.sources = tail
>> >> >> agent1.channels = MemoryChannel-2
>> >> >> agent1.sinks = svc_0_sink
>> >> >>
>> >> >>
>> >> >> agent1.sources.tail.type = exec
>> >> >> agent1.sources.tail.command = tail -f /var/log/access.log
>> >> >> agent1.sources.tail.channels = MemoryChannel-2
>> >> >>
>> >> >> agent1.sinks.svc_0_sink.type = FILE_ROLL
>> >> >> agent1.sinks.svc_0_sink.sink.directory=/flume_runtime/logs
>> >> >> agent1.sinks.svc_0_sink.rollInterval=0
>> >> >>
>> >> >> agent1.channels.MemoryChannel-2.type = memory
>> >> >>
>> >> >>
>> >> >> On Thu, Jun 14, 2012 at 4:26 AM, Guillaume Polaert <
>> gpolaert@cyres.fr>
>> >> >> wrote:
>> >> >>>
>> >> >>> Hi Bhaskar,
>> >> >>>
>> >> >>> This is the flume.conf (http://pastebin.com/WULgUuaf) what I'm
>> using.
>> >> >>> I have an avro server on the hadoop-m host and one agent per node
>> >> >>> (slave
>> >> >>> hosts). Each agent send the ouput of a exec command to avro server.
>> >> >>>
>> >> >>> Host1 : exec -> memory -> avro (sink)
>> >> >>>
>> >> >>> Host2 : exec -> memory -> avro
>> >> >>>                                                >>>>>    MainHost :
>> >> >>> avro
>> >> >>> (source) -> memory -> rolling file (local FS)
>> >> >>> ...
>> >> >>>
>> >> >>> Host3 : exec -> memory -> avro
>> >> >>>
>> >> >>>
>> >> >>> Use your own exec command to read Apache log.
>> >> >>>
>> >> >>> Guillaume Polaert | Cyrès Conseil
>> >> >>>
>> >> >>> De : Bhaskar [mailto:bmarthi@gmail.com]
>> >> >>> Envoyé : mercredi 13 juin 2012 19:16
>> >> >>> À : flume-user@incubator.apache.org
>> >> >>> Objet : Newbee question about flume 1.2 set up
>> >> >>>
>> >> >>> Good Afternoon,
>> >> >>> I am a newbee to flume and read thru limited documentation
>> available.
>> >> >>>  I
>> >> >>> would like to set up the following to test out.
>> >> >>>
>> >> >>> 1.  Read apache access logs (as source)
>> >> >>> 2.  Use memory channel
>> >> >>> 3.  Write it to a NFS (or even local) file system
>> >> >>>
>> >> >>> Can some one help me with the necessary configuration.  I am having
>> >> >>> difficult time to glean that information from available
>> documentation.
>> >> >>>  I am
>> >> >>> sure someone has done such test before and i appreciate if you can
>> >> >>> pass on
>> >> >>> that information.  Secondly, I also would like to stream the logs
>> to a
>> >> >>> remote server.  Is that a log4j configuration or do i need to run
>> an
>> >> >>> agent
>> >> >>> on each host to do so?  Any configuration examples would be of
>> great
>> >> >>> help.
>> >> >>>
>> >> >>> Thanks,
>> >> >>> Bhaskar
>> >> >>
>> >> >>
>> >> >
>> >
>> >
>>
>
>
>

Re: Newbee question about flume 1.2 set up

Posted by Juhani Connolly <ju...@cyberagent.co.jp>.
Hello Bhaskar,

Using Avro is generally the recommended method to handle multi-hop 
flows, so no concerns there.

Have you tried this setup using memory channels instead of jdbc? Last 
time I tested it, the JDBC channel had poor throughput, so you may be 
getting a logjam somewhere. How much data is getting entered into your 
logfile? Try raising the capacity on your jdbc channel by a lot(10000?). 
With a capacity of 10, if the reading side(host b) isn't polling 
frequently enough, there's going to be problems. This is probably why 
you get the "failed to persist event". As far as FLUME-1259 is 
concerned, that should only be happening if bad data is being sent. 
You're not sending anything else to the same port are you? Make sure 
that only the source and sink are set to that port and that nothing else is.

If the problem continues, please post a chunk of the logs leading up to 
the OOM error(the full trace for the cause should be enough)





On 06/16/2012 12:01 AM, Bhaskar wrote:
> Sorry to be a pest with stream of questions.  I think i am going two 
> steps forward and four steps back:-).  After my first successful 
> attempt, i tried running flume with the following flow:
>
> 1.  HostA
>   -- Source  is tail web server log
>   -- channel jdbc
>   -- sink is AVRO collection on Host B
> Configuraiton:
> agent3.sources = tail
> agent3.sinks = avro-forward-sink
> agent3.channels = jdbc-channel
>
> # Define source flow
> agent3.sources.tail.type = exec
> agent3.sources.tail.command = tail -f /common/log/access.log
> agent3.sources.tail.channels = jdbc-channel
>
> # define the flow
> agent3.sinks.avro-forward-sink.channel = jdbc-channel
>
> # avro sink properties
> agent3.sources.avro-forward-sink.type = avro
> agent3.sources.avro-forward-sink.hostname = <<IP Address>>
> agent3.sources.avro-forward-sink.port = <<PORT>>
>
> # Define channels
> agent3.channels.jdbc-channel.type = jdbc
> agent3.channels.jdbc-channel.maximum.capacity = 10
> agent3.channels.jdbc-channel.maximum.connections = 2
>
>
> 2.  HostB
>   -- Source is AVRO collection
>   -- channel is memory
>   -- sink is local file system
>
> Configuration:
> # list sources, sinks and channels in the agent4
> agent4.sources = avro-collection-source
> agent4.sinks = svc_0_sink
> agent4.channels = MemoryChannel-2
>
> # define the flow
> agent4.sources.avro-collection-source.channels = MemoryChannel-2
> agent4.sinks.svc_0_sink.channel = MemoryChannel-2
>
> # avro sink properties
> agent4.sources.avro-collection-source.type = avro
> agent4.sources.avro-collection-source.bind = <<IP Address>>
> agent4.sources.avro-collection-source.port = <<PORT>>
>
> agent4.sinks.svc_0_sink.type = FILE_ROLL
> agent4.sinks.svc_0_sink.sink.directory=/logs/agent4
> agent4.sinks.svc_0_sink.rollInterval=600
> agent4.sinks.svc_0_sink.channel = MemoryChannel-2
>
> agent4.channels.MemoryChannel-2.type = memory
> agent4.channels.MemoryChannel-2.capacity = 100
> agent4.channels.MemoryChannel-2.transactionCapacity = 10
>
>
> Basically i am trying to tail a file on one host, stream it to another 
> host running sink.  During the trial run, the configuration is loaded 
> fine and i see the channels created fine.  I see an exception from the 
> jdbc channel first (Failed to persist event).  I am getting a java 
> heap space OOM exception from Host B when Host A attempts to write.
>
> 2012-06-15 10:31:44,503 WARN ipc.NettyServer: Unexpected exception 
> from downstream.
> java.lang.OutOfMemoryError: Java heap space
>
> This issue was already reported 
> https://issues.apache.org/jira/browse/FLUME-1259 but i am not sure if 
> there is a work around to this problem.  I have couple questions:
>
> 1.  Am i force fitting a wrong solution here using AVRO?
> 2.  if so, what would be a right solution for streaming data from Host 
> A to Host B (or thru intermediaries)?
>
> Thanks,
> Bhaskar
>
> On Thu, Jun 14, 2012 at 4:31 PM, Mohammad Tariq <dontariq@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Since you are thinking of using multi-hop flow I would suggest to go
>     for "JDBC Channel" as there is higher chance of error than single-hop
>     flow and in JDBC Channel events are stored in a persistent storage
>     that’s backed by a database. For detailed guidelines you can refer
>     Flume 1.x User Guide at -
>     https://people.apache.org/~mpercy/flume/flume-1.2.0-incubating-SNAPSHOT/docs/FlumeUserGuide.html
>     <https://people.apache.org/%7Empercy/flume/flume-1.2.0-incubating-SNAPSHOT/docs/FlumeUserGuide.html>
>
>     Regards,
>         Mohammad Tariq
>
>
>     On Fri, Jun 15, 2012 at 12:46 AM, Bhaskar <bmarthi@gmail.com
>     <ma...@gmail.com>> wrote:
>     > Hi Mohammad,
>     > Thanks for the pointer there.  Do you think using a message
>     queue (like
>     > rabbitmq) would be a choice of communication channel between
>     each hop?  i am
>     > struggling to get a handle on how i need to configure my sink in
>     > intermediary hops in a multi-hop flow.    Appreciate any
>     guidance/examples.
>     >
>     > Thanks,
>     > Bhaskar
>     >
>     >
>     > On Thu, Jun 14, 2012 at 1:57 PM, Mohammad Tariq
>     <dontariq@gmail.com <ma...@gmail.com>> wrote:
>     >>
>     >> Hello Bhaskar,
>     >>
>     >>      That's great..And the best approach to stream logs depends
>     upon
>     >> the type of source you want to watch for..And by looking at your
>     >> usecase, I would suggest to go for "multi-hop" flows where events
>     >> travel through multiple agents before reaching the final
>     destination.
>     >>
>     >> Regards,
>     >>     Mohammad Tariq
>     >>
>     >>
>     >> On Thu, Jun 14, 2012 at 10:48 PM, Bhaskar <bmarthi@gmail.com
>     <ma...@gmail.com>> wrote:
>     >> > I know what i am missing:-)  I missed connecting the sink
>     with the
>     >> > channel.
>     >> >  My small POC works now and i am able to view the streamed
>     logs.  Thank
>     >> > you
>     >> > all for the guidance and patience in answering all questions.
>      So, whats
>     >> > the
>     >> > best approach to stream logs from other hosts?  Basically my
>     next task
>     >> > would
>     >> > be to set up collector (sort of) model to stream logs to
>     intermediary
>     >> > and
>     >> > then stream from collector to a sink location.  I'd
>     appreciate any
>     >> > thoughts/guidance in this regard.
>     >> >
>     >> > Bhaskar
>     >> >
>     >> >
>     >> > On Thu, Jun 14, 2012 at 12:52 PM, Bhaskar <bmarthi@gmail.com
>     <ma...@gmail.com>> wrote:
>     >> >>
>     >> >> For testing purposes, I tried with the following
>     configuration without
>     >> >> much luck.  I see that the process started fine but it just
>     does not
>     >> >> write
>     >> >> anything to the sink.  I guess i am missing something here.
>      Can one of
>     >> >> you
>     >> >> gurus take a look and suggest what i am doing wrong?
>     >> >>
>     >> >> Thanks,
>     >> >> Bhaskar
>     >> >>
>     >> >> agent1.sources = tail
>     >> >> agent1.channels = MemoryChannel-2
>     >> >> agent1.sinks = svc_0_sink
>     >> >>
>     >> >>
>     >> >> agent1.sources.tail.type = exec
>     >> >> agent1.sources.tail.command = tail -f /var/log/access.log
>     >> >> agent1.sources.tail.channels = MemoryChannel-2
>     >> >>
>     >> >> agent1.sinks.svc_0_sink.type = FILE_ROLL
>     >> >> agent1.sinks.svc_0_sink.sink.directory=/flume_runtime/logs
>     >> >> agent1.sinks.svc_0_sink.rollInterval=0
>     >> >>
>     >> >> agent1.channels.MemoryChannel-2.type = memory
>     >> >>
>     >> >>
>     >> >> On Thu, Jun 14, 2012 at 4:26 AM, Guillaume Polaert
>     <gpolaert@cyres.fr <ma...@cyres.fr>>
>     >> >> wrote:
>     >> >>>
>     >> >>> Hi Bhaskar,
>     >> >>>
>     >> >>> This is the flume.conf (http://pastebin.com/WULgUuaf) what
>     I'm using.
>     >> >>> I have an avro server on the hadoop-m host and one agent
>     per node
>     >> >>> (slave
>     >> >>> hosts). Each agent send the ouput of a exec command to avro
>     server.
>     >> >>>
>     >> >>> Host1 : exec -> memory -> avro (sink)
>     >> >>>
>     >> >>> Host2 : exec -> memory -> avro
>     >> >>> >>>>>    MainHost :
>     >> >>> avro
>     >> >>> (source) -> memory -> rolling file (local FS)
>     >> >>> ...
>     >> >>>
>     >> >>> Host3 : exec -> memory -> avro
>     >> >>>
>     >> >>>
>     >> >>> Use your own exec command to read Apache log.
>     >> >>>
>     >> >>> Guillaume Polaert | Cyrès Conseil
>     >> >>>
>     >> >>> De : Bhaskar [mailto:bmarthi@gmail.com
>     <ma...@gmail.com>]
>     >> >>> Envoyé : mercredi 13 juin 2012 19:16
>     >> >>> À : flume-user@incubator.apache.org
>     <ma...@incubator.apache.org>
>     >> >>> Objet : Newbee question about flume 1.2 set up
>     >> >>>
>     >> >>> Good Afternoon,
>     >> >>> I am a newbee to flume and read thru limited documentation
>     available.
>     >> >>>  I
>     >> >>> would like to set up the following to test out.
>     >> >>>
>     >> >>> 1.  Read apache access logs (as source)
>     >> >>> 2.  Use memory channel
>     >> >>> 3.  Write it to a NFS (or even local) file system
>     >> >>>
>     >> >>> Can some one help me with the necessary configuration.  I
>     am having
>     >> >>> difficult time to glean that information from available
>     documentation.
>     >> >>>  I am
>     >> >>> sure someone has done such test before and i appreciate if
>     you can
>     >> >>> pass on
>     >> >>> that information.  Secondly, I also would like to stream
>     the logs to a
>     >> >>> remote server.  Is that a log4j configuration or do i need
>     to run an
>     >> >>> agent
>     >> >>> on each host to do so?  Any configuration examples would be
>     of great
>     >> >>> help.
>     >> >>>
>     >> >>> Thanks,
>     >> >>> Bhaskar
>     >> >>
>     >> >>
>     >> >
>     >
>     >
>
>


Re: Newbee question about flume 1.2 set up

Posted by Bhaskar <bm...@gmail.com>.
Sorry to be a pest with stream of questions.  I think i am going two steps
forward and four steps back:-).  After my first successful attempt, i tried
running flume with the following flow:

1.  HostA
  -- Source  is tail web server log
  -- channel jdbc
  -- sink is AVRO collection on Host B
Configuraiton:
agent3.sources = tail
agent3.sinks = avro-forward-sink
agent3.channels = jdbc-channel

# Define source flow
agent3.sources.tail.type = exec
agent3.sources.tail.command = tail -f /common/log/access.log
agent3.sources.tail.channels = jdbc-channel

# define the flow
agent3.sinks.avro-forward-sink.channel = jdbc-channel

# avro sink properties
agent3.sources.avro-forward-sink.type = avro
agent3.sources.avro-forward-sink.hostname = <<IP Address>>
agent3.sources.avro-forward-sink.port = <<PORT>>

# Define channels
agent3.channels.jdbc-channel.type = jdbc
agent3.channels.jdbc-channel.maximum.capacity = 10
agent3.channels.jdbc-channel.maximum.connections = 2


2.  HostB
  -- Source is AVRO collection
  -- channel is memory
  -- sink is local file system

Configuration:
# list sources, sinks and channels in the agent4
agent4.sources = avro-collection-source
agent4.sinks = svc_0_sink
agent4.channels = MemoryChannel-2

# define the flow
agent4.sources.avro-collection-source.channels = MemoryChannel-2
agent4.sinks.svc_0_sink.channel = MemoryChannel-2

# avro sink properties
agent4.sources.avro-collection-source.type = avro
agent4.sources.avro-collection-source.bind = <<IP Address>>
agent4.sources.avro-collection-source.port = <<PORT>>

agent4.sinks.svc_0_sink.type = FILE_ROLL
agent4.sinks.svc_0_sink.sink.directory=/logs/agent4
agent4.sinks.svc_0_sink.rollInterval=600
agent4.sinks.svc_0_sink.channel = MemoryChannel-2

agent4.channels.MemoryChannel-2.type = memory
agent4.channels.MemoryChannel-2.capacity = 100
agent4.channels.MemoryChannel-2.transactionCapacity = 10


Basically i am trying to tail a file on one host, stream it to another host
running sink.  During the trial run, the configuration is loaded fine and i
see the channels created fine.  I see an exception from the jdbc channel
first (Failed to persist event).  I am getting a java heap space OOM
exception from Host B when Host A attempts to write.

2012-06-15 10:31:44,503 WARN ipc.NettyServer: Unexpected exception from
downstream.
java.lang.OutOfMemoryError: Java heap space

This issue was already reported
https://issues.apache.org/jira/browse/FLUME-1259 but i am not sure if there
is a work around to this problem.  I have couple questions:

1.  Am i force fitting a wrong solution here using AVRO?
2.  if so, what would be a right solution for streaming data from Host A to
Host B (or thru intermediaries)?

Thanks,
Bhaskar

On Thu, Jun 14, 2012 at 4:31 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Since you are thinking of using multi-hop flow I would suggest to go
> for "JDBC Channel" as there is higher chance of error than single-hop
> flow and in JDBC Channel events are stored in a persistent storage
> that’s backed by a database. For detailed guidelines you can refer
> Flume 1.x User Guide at -
>
> https://people.apache.org/~mpercy/flume/flume-1.2.0-incubating-SNAPSHOT/docs/FlumeUserGuide.html
>
> Regards,
>     Mohammad Tariq
>
>
> On Fri, Jun 15, 2012 at 12:46 AM, Bhaskar <bm...@gmail.com> wrote:
> > Hi Mohammad,
> > Thanks for the pointer there.  Do you think using a message queue (like
> > rabbitmq) would be a choice of communication channel between each hop?
>  i am
> > struggling to get a handle on how i need to configure my sink in
> > intermediary hops in a multi-hop flow.    Appreciate any
> guidance/examples.
> >
> > Thanks,
> > Bhaskar
> >
> >
> > On Thu, Jun 14, 2012 at 1:57 PM, Mohammad Tariq <do...@gmail.com>
> wrote:
> >>
> >> Hello Bhaskar,
> >>
> >>      That's great..And the best approach to stream logs depends upon
> >> the type of source you want to watch for..And by looking at your
> >> usecase, I would suggest to go for "multi-hop" flows where events
> >> travel through multiple agents before reaching the final destination.
> >>
> >> Regards,
> >>     Mohammad Tariq
> >>
> >>
> >> On Thu, Jun 14, 2012 at 10:48 PM, Bhaskar <bm...@gmail.com> wrote:
> >> > I know what i am missing:-)  I missed connecting the sink with the
> >> > channel.
> >> >  My small POC works now and i am able to view the streamed logs.
>  Thank
> >> > you
> >> > all for the guidance and patience in answering all questions.  So,
> whats
> >> > the
> >> > best approach to stream logs from other hosts?  Basically my next task
> >> > would
> >> > be to set up collector (sort of) model to stream logs to intermediary
> >> > and
> >> > then stream from collector to a sink location.  I'd appreciate any
> >> > thoughts/guidance in this regard.
> >> >
> >> > Bhaskar
> >> >
> >> >
> >> > On Thu, Jun 14, 2012 at 12:52 PM, Bhaskar <bm...@gmail.com> wrote:
> >> >>
> >> >> For testing purposes, I tried with the following configuration
> without
> >> >> much luck.  I see that the process started fine but it just does not
> >> >> write
> >> >> anything to the sink.  I guess i am missing something here.  Can one
> of
> >> >> you
> >> >> gurus take a look and suggest what i am doing wrong?
> >> >>
> >> >> Thanks,
> >> >> Bhaskar
> >> >>
> >> >> agent1.sources = tail
> >> >> agent1.channels = MemoryChannel-2
> >> >> agent1.sinks = svc_0_sink
> >> >>
> >> >>
> >> >> agent1.sources.tail.type = exec
> >> >> agent1.sources.tail.command = tail -f /var/log/access.log
> >> >> agent1.sources.tail.channels = MemoryChannel-2
> >> >>
> >> >> agent1.sinks.svc_0_sink.type = FILE_ROLL
> >> >> agent1.sinks.svc_0_sink.sink.directory=/flume_runtime/logs
> >> >> agent1.sinks.svc_0_sink.rollInterval=0
> >> >>
> >> >> agent1.channels.MemoryChannel-2.type = memory
> >> >>
> >> >>
> >> >> On Thu, Jun 14, 2012 at 4:26 AM, Guillaume Polaert <
> gpolaert@cyres.fr>
> >> >> wrote:
> >> >>>
> >> >>> Hi Bhaskar,
> >> >>>
> >> >>> This is the flume.conf (http://pastebin.com/WULgUuaf) what I'm
> using.
> >> >>> I have an avro server on the hadoop-m host and one agent per node
> >> >>> (slave
> >> >>> hosts). Each agent send the ouput of a exec command to avro server.
> >> >>>
> >> >>> Host1 : exec -> memory -> avro (sink)
> >> >>>
> >> >>> Host2 : exec -> memory -> avro
> >> >>>                                                >>>>>    MainHost :
> >> >>> avro
> >> >>> (source) -> memory -> rolling file (local FS)
> >> >>> ...
> >> >>>
> >> >>> Host3 : exec -> memory -> avro
> >> >>>
> >> >>>
> >> >>> Use your own exec command to read Apache log.
> >> >>>
> >> >>> Guillaume Polaert | Cyrès Conseil
> >> >>>
> >> >>> De : Bhaskar [mailto:bmarthi@gmail.com]
> >> >>> Envoyé : mercredi 13 juin 2012 19:16
> >> >>> À : flume-user@incubator.apache.org
> >> >>> Objet : Newbee question about flume 1.2 set up
> >> >>>
> >> >>> Good Afternoon,
> >> >>> I am a newbee to flume and read thru limited documentation
> available.
> >> >>>  I
> >> >>> would like to set up the following to test out.
> >> >>>
> >> >>> 1.  Read apache access logs (as source)
> >> >>> 2.  Use memory channel
> >> >>> 3.  Write it to a NFS (or even local) file system
> >> >>>
> >> >>> Can some one help me with the necessary configuration.  I am having
> >> >>> difficult time to glean that information from available
> documentation.
> >> >>>  I am
> >> >>> sure someone has done such test before and i appreciate if you can
> >> >>> pass on
> >> >>> that information.  Secondly, I also would like to stream the logs
> to a
> >> >>> remote server.  Is that a log4j configuration or do i need to run an
> >> >>> agent
> >> >>> on each host to do so?  Any configuration examples would be of great
> >> >>> help.
> >> >>>
> >> >>> Thanks,
> >> >>> Bhaskar
> >> >>
> >> >>
> >> >
> >
> >
>

Re: Newbee question about flume 1.2 set up

Posted by Mohammad Tariq <do...@gmail.com>.
Since you are thinking of using multi-hop flow I would suggest to go
for "JDBC Channel" as there is higher chance of error than single-hop
flow and in JDBC Channel events are stored in a persistent storage
that’s backed by a database. For detailed guidelines you can refer
Flume 1.x User Guide at -
https://people.apache.org/~mpercy/flume/flume-1.2.0-incubating-SNAPSHOT/docs/FlumeUserGuide.html

Regards,
    Mohammad Tariq


On Fri, Jun 15, 2012 at 12:46 AM, Bhaskar <bm...@gmail.com> wrote:
> Hi Mohammad,
> Thanks for the pointer there.  Do you think using a message queue (like
> rabbitmq) would be a choice of communication channel between each hop?  i am
> struggling to get a handle on how i need to configure my sink in
> intermediary hops in a multi-hop flow.    Appreciate any guidance/examples.
>
> Thanks,
> Bhaskar
>
>
> On Thu, Jun 14, 2012 at 1:57 PM, Mohammad Tariq <do...@gmail.com> wrote:
>>
>> Hello Bhaskar,
>>
>>      That's great..And the best approach to stream logs depends upon
>> the type of source you want to watch for..And by looking at your
>> usecase, I would suggest to go for "multi-hop" flows where events
>> travel through multiple agents before reaching the final destination.
>>
>> Regards,
>>     Mohammad Tariq
>>
>>
>> On Thu, Jun 14, 2012 at 10:48 PM, Bhaskar <bm...@gmail.com> wrote:
>> > I know what i am missing:-)  I missed connecting the sink with the
>> > channel.
>> >  My small POC works now and i am able to view the streamed logs.  Thank
>> > you
>> > all for the guidance and patience in answering all questions.  So, whats
>> > the
>> > best approach to stream logs from other hosts?  Basically my next task
>> > would
>> > be to set up collector (sort of) model to stream logs to intermediary
>> > and
>> > then stream from collector to a sink location.  I'd appreciate any
>> > thoughts/guidance in this regard.
>> >
>> > Bhaskar
>> >
>> >
>> > On Thu, Jun 14, 2012 at 12:52 PM, Bhaskar <bm...@gmail.com> wrote:
>> >>
>> >> For testing purposes, I tried with the following configuration without
>> >> much luck.  I see that the process started fine but it just does not
>> >> write
>> >> anything to the sink.  I guess i am missing something here.  Can one of
>> >> you
>> >> gurus take a look and suggest what i am doing wrong?
>> >>
>> >> Thanks,
>> >> Bhaskar
>> >>
>> >> agent1.sources = tail
>> >> agent1.channels = MemoryChannel-2
>> >> agent1.sinks = svc_0_sink
>> >>
>> >>
>> >> agent1.sources.tail.type = exec
>> >> agent1.sources.tail.command = tail -f /var/log/access.log
>> >> agent1.sources.tail.channels = MemoryChannel-2
>> >>
>> >> agent1.sinks.svc_0_sink.type = FILE_ROLL
>> >> agent1.sinks.svc_0_sink.sink.directory=/flume_runtime/logs
>> >> agent1.sinks.svc_0_sink.rollInterval=0
>> >>
>> >> agent1.channels.MemoryChannel-2.type = memory
>> >>
>> >>
>> >> On Thu, Jun 14, 2012 at 4:26 AM, Guillaume Polaert <gp...@cyres.fr>
>> >> wrote:
>> >>>
>> >>> Hi Bhaskar,
>> >>>
>> >>> This is the flume.conf (http://pastebin.com/WULgUuaf) what I'm using.
>> >>> I have an avro server on the hadoop-m host and one agent per node
>> >>> (slave
>> >>> hosts). Each agent send the ouput of a exec command to avro server.
>> >>>
>> >>> Host1 : exec -> memory -> avro (sink)
>> >>>
>> >>> Host2 : exec -> memory -> avro
>> >>>                                                >>>>>    MainHost :
>> >>> avro
>> >>> (source) -> memory -> rolling file (local FS)
>> >>> ...
>> >>>
>> >>> Host3 : exec -> memory -> avro
>> >>>
>> >>>
>> >>> Use your own exec command to read Apache log.
>> >>>
>> >>> Guillaume Polaert | Cyrès Conseil
>> >>>
>> >>> De : Bhaskar [mailto:bmarthi@gmail.com]
>> >>> Envoyé : mercredi 13 juin 2012 19:16
>> >>> À : flume-user@incubator.apache.org
>> >>> Objet : Newbee question about flume 1.2 set up
>> >>>
>> >>> Good Afternoon,
>> >>> I am a newbee to flume and read thru limited documentation available.
>> >>>  I
>> >>> would like to set up the following to test out.
>> >>>
>> >>> 1.  Read apache access logs (as source)
>> >>> 2.  Use memory channel
>> >>> 3.  Write it to a NFS (or even local) file system
>> >>>
>> >>> Can some one help me with the necessary configuration.  I am having
>> >>> difficult time to glean that information from available documentation.
>> >>>  I am
>> >>> sure someone has done such test before and i appreciate if you can
>> >>> pass on
>> >>> that information.  Secondly, I also would like to stream the logs to a
>> >>> remote server.  Is that a log4j configuration or do i need to run an
>> >>> agent
>> >>> on each host to do so?  Any configuration examples would be of great
>> >>> help.
>> >>>
>> >>> Thanks,
>> >>> Bhaskar
>> >>
>> >>
>> >
>
>

Re: Newbee question about flume 1.2 set up

Posted by Bhaskar <bm...@gmail.com>.
Hi Mohammad,
Thanks for the pointer there.  Do you think using a message queue (like
rabbitmq) would be a choice of communication channel between each hop?  i
am struggling to get a handle on how i need to configure my sink in
intermediary hops in a multi-hop flow.    Appreciate any guidance/examples.

Thanks,
Bhaskar

On Thu, Jun 14, 2012 at 1:57 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello Bhaskar,
>
>      That's great..And the best approach to stream logs depends upon
> the type of source you want to watch for..And by looking at your
> usecase, I would suggest to go for "multi-hop" flows where events
> travel through multiple agents before reaching the final destination.
>
> Regards,
>     Mohammad Tariq
>
>
> On Thu, Jun 14, 2012 at 10:48 PM, Bhaskar <bm...@gmail.com> wrote:
> > I know what i am missing:-)  I missed connecting the sink with the
> channel.
> >  My small POC works now and i am able to view the streamed logs.  Thank
> you
> > all for the guidance and patience in answering all questions.  So, whats
> the
> > best approach to stream logs from other hosts?  Basically my next task
> would
> > be to set up collector (sort of) model to stream logs to intermediary and
> > then stream from collector to a sink location.  I'd appreciate any
> > thoughts/guidance in this regard.
> >
> > Bhaskar
> >
> >
> > On Thu, Jun 14, 2012 at 12:52 PM, Bhaskar <bm...@gmail.com> wrote:
> >>
> >> For testing purposes, I tried with the following configuration without
> >> much luck.  I see that the process started fine but it just does not
> write
> >> anything to the sink.  I guess i am missing something here.  Can one of
> you
> >> gurus take a look and suggest what i am doing wrong?
> >>
> >> Thanks,
> >> Bhaskar
> >>
> >> agent1.sources = tail
> >> agent1.channels = MemoryChannel-2
> >> agent1.sinks = svc_0_sink
> >>
> >>
> >> agent1.sources.tail.type = exec
> >> agent1.sources.tail.command = tail -f /var/log/access.log
> >> agent1.sources.tail.channels = MemoryChannel-2
> >>
> >> agent1.sinks.svc_0_sink.type = FILE_ROLL
> >> agent1.sinks.svc_0_sink.sink.directory=/flume_runtime/logs
> >> agent1.sinks.svc_0_sink.rollInterval=0
> >>
> >> agent1.channels.MemoryChannel-2.type = memory
> >>
> >>
> >> On Thu, Jun 14, 2012 at 4:26 AM, Guillaume Polaert <gp...@cyres.fr>
> >> wrote:
> >>>
> >>> Hi Bhaskar,
> >>>
> >>> This is the flume.conf (http://pastebin.com/WULgUuaf) what I'm using.
> >>> I have an avro server on the hadoop-m host and one agent per node
> (slave
> >>> hosts). Each agent send the ouput of a exec command to avro server.
> >>>
> >>> Host1 : exec -> memory -> avro (sink)
> >>>
> >>> Host2 : exec -> memory -> avro
> >>>                                                >>>>>    MainHost : avro
> >>> (source) -> memory -> rolling file (local FS)
> >>> ...
> >>>
> >>> Host3 : exec -> memory -> avro
> >>>
> >>>
> >>> Use your own exec command to read Apache log.
> >>>
> >>> Guillaume Polaert | Cyrès Conseil
> >>>
> >>> De : Bhaskar [mailto:bmarthi@gmail.com]
> >>> Envoyé : mercredi 13 juin 2012 19:16
> >>> À : flume-user@incubator.apache.org
> >>> Objet : Newbee question about flume 1.2 set up
> >>>
> >>> Good Afternoon,
> >>> I am a newbee to flume and read thru limited documentation available.
>  I
> >>> would like to set up the following to test out.
> >>>
> >>> 1.  Read apache access logs (as source)
> >>> 2.  Use memory channel
> >>> 3.  Write it to a NFS (or even local) file system
> >>>
> >>> Can some one help me with the necessary configuration.  I am having
> >>> difficult time to glean that information from available documentation.
>  I am
> >>> sure someone has done such test before and i appreciate if you can
> pass on
> >>> that information.  Secondly, I also would like to stream the logs to a
> >>> remote server.  Is that a log4j configuration or do i need to run an
> agent
> >>> on each host to do so?  Any configuration examples would be of great
> help.
> >>>
> >>> Thanks,
> >>> Bhaskar
> >>
> >>
> >
>

Re: Newbee question about flume 1.2 set up

Posted by Mohammad Tariq <do...@gmail.com>.
Hello Bhaskar,

      That's great..And the best approach to stream logs depends upon
the type of source you want to watch for..And by looking at your
usecase, I would suggest to go for "multi-hop" flows where events
travel through multiple agents before reaching the final destination.

Regards,
    Mohammad Tariq


On Thu, Jun 14, 2012 at 10:48 PM, Bhaskar <bm...@gmail.com> wrote:
> I know what i am missing:-)  I missed connecting the sink with the channel.
>  My small POC works now and i am able to view the streamed logs.  Thank you
> all for the guidance and patience in answering all questions.  So, whats the
> best approach to stream logs from other hosts?  Basically my next task would
> be to set up collector (sort of) model to stream logs to intermediary and
> then stream from collector to a sink location.  I'd appreciate any
> thoughts/guidance in this regard.
>
> Bhaskar
>
>
> On Thu, Jun 14, 2012 at 12:52 PM, Bhaskar <bm...@gmail.com> wrote:
>>
>> For testing purposes, I tried with the following configuration without
>> much luck.  I see that the process started fine but it just does not write
>> anything to the sink.  I guess i am missing something here.  Can one of you
>> gurus take a look and suggest what i am doing wrong?
>>
>> Thanks,
>> Bhaskar
>>
>> agent1.sources = tail
>> agent1.channels = MemoryChannel-2
>> agent1.sinks = svc_0_sink
>>
>>
>> agent1.sources.tail.type = exec
>> agent1.sources.tail.command = tail -f /var/log/access.log
>> agent1.sources.tail.channels = MemoryChannel-2
>>
>> agent1.sinks.svc_0_sink.type = FILE_ROLL
>> agent1.sinks.svc_0_sink.sink.directory=/flume_runtime/logs
>> agent1.sinks.svc_0_sink.rollInterval=0
>>
>> agent1.channels.MemoryChannel-2.type = memory
>>
>>
>> On Thu, Jun 14, 2012 at 4:26 AM, Guillaume Polaert <gp...@cyres.fr>
>> wrote:
>>>
>>> Hi Bhaskar,
>>>
>>> This is the flume.conf (http://pastebin.com/WULgUuaf) what I'm using.
>>> I have an avro server on the hadoop-m host and one agent per node (slave
>>> hosts). Each agent send the ouput of a exec command to avro server.
>>>
>>> Host1 : exec -> memory -> avro (sink)
>>>
>>> Host2 : exec -> memory -> avro
>>>                                                >>>>>    MainHost : avro
>>> (source) -> memory -> rolling file (local FS)
>>> ...
>>>
>>> Host3 : exec -> memory -> avro
>>>
>>>
>>> Use your own exec command to read Apache log.
>>>
>>> Guillaume Polaert | Cyrès Conseil
>>>
>>> De : Bhaskar [mailto:bmarthi@gmail.com]
>>> Envoyé : mercredi 13 juin 2012 19:16
>>> À : flume-user@incubator.apache.org
>>> Objet : Newbee question about flume 1.2 set up
>>>
>>> Good Afternoon,
>>> I am a newbee to flume and read thru limited documentation available.  I
>>> would like to set up the following to test out.
>>>
>>> 1.  Read apache access logs (as source)
>>> 2.  Use memory channel
>>> 3.  Write it to a NFS (or even local) file system
>>>
>>> Can some one help me with the necessary configuration.  I am having
>>> difficult time to glean that information from available documentation.  I am
>>> sure someone has done such test before and i appreciate if you can pass on
>>> that information.  Secondly, I also would like to stream the logs to a
>>> remote server.  Is that a log4j configuration or do i need to run an agent
>>> on each host to do so?  Any configuration examples would be of great help.
>>>
>>> Thanks,
>>> Bhaskar
>>
>>
>

Re: Newbee question about flume 1.2 set up

Posted by Bhaskar <bm...@gmail.com>.
I know what i am missing:-)  I missed connecting the sink with the channel.
 My small POC works now and i am able to view the streamed logs.  Thank you
all for the guidance and patience in answering all questions.  So, whats
the best approach to stream logs from other hosts?  Basically my next task
would be to set up collector (sort of) model to stream logs to intermediary
and then stream from collector to a sink location.  I'd appreciate any
thoughts/guidance in this regard.

Bhaskar

On Thu, Jun 14, 2012 at 12:52 PM, Bhaskar <bm...@gmail.com> wrote:

> For testing purposes, I tried with the following configuration without
> much luck.  I see that the process started fine but it just does not write
> anything to the sink.  I guess i am missing something here.  Can one of you
> gurus take a look and suggest what i am doing wrong?
>
> Thanks,
> Bhaskar
>
> agent1.sources = tail
> agent1.channels = MemoryChannel-2
> agent1.sinks = svc_0_sink
>
>
> agent1.sources.tail.type = exec
> agent1.sources.tail.command = tail -f /var/log/access.log
> agent1.sources.tail.channels = MemoryChannel-2
>
> agent1.sinks.svc_0_sink.type = FILE_ROLL
> agent1.sinks.svc_0_sink.sink.directory=/flume_runtime/logs
> agent1.sinks.svc_0_sink.rollInterval=0
>
> agent1.channels.MemoryChannel-2.type = memory
>
>
> On Thu, Jun 14, 2012 at 4:26 AM, Guillaume Polaert <gp...@cyres.fr>wrote:
>
>> Hi Bhaskar,
>>
>> This is the flume.conf (http://pastebin.com/WULgUuaf) what I'm using.
>> I have an avro server on the hadoop-m host and one agent per node (slave
>> hosts). Each agent send the ouput of a exec command to avro server.
>>
>> Host1 : exec -> memory -> avro (sink)
>>
>> Host2 : exec -> memory -> avro
>>                                                >>>>>    MainHost : avro
>> (source) -> memory -> rolling file (local FS)
>> ...
>>
>> Host3 : exec -> memory -> avro
>>
>>
>> Use your own exec command to read Apache log.
>>
>> Guillaume Polaert | Cyrès Conseil
>>
>> De : Bhaskar [mailto:bmarthi@gmail.com]
>> Envoyé : mercredi 13 juin 2012 19:16
>> À : flume-user@incubator.apache.org
>> Objet : Newbee question about flume 1.2 set up
>>
>> Good Afternoon,
>> I am a newbee to flume and read thru limited documentation available.  I
>> would like to set up the following to test out.
>>
>> 1.  Read apache access logs (as source)
>> 2.  Use memory channel
>> 3.  Write it to a NFS (or even local) file system
>>
>> Can some one help me with the necessary configuration.  I am having
>> difficult time to glean that information from available documentation.  I
>> am sure someone has done such test before and i appreciate if you can pass
>> on that information.  Secondly, I also would like to stream the logs to a
>> remote server.  Is that a log4j configuration or do i need to run an agent
>> on each host to do so?  Any configuration examples would be of great help.
>>
>> Thanks,
>> Bhaskar
>>
>
>

Re: Newbee question about flume 1.2 set up

Posted by Bhaskar <bm...@gmail.com>.
For testing purposes, I tried with the following configuration without much
luck.  I see that the process started fine but it just does not write
anything to the sink.  I guess i am missing something here.  Can one of you
gurus take a look and suggest what i am doing wrong?

Thanks,
Bhaskar

agent1.sources = tail
agent1.channels = MemoryChannel-2
agent1.sinks = svc_0_sink


agent1.sources.tail.type = exec
agent1.sources.tail.command = tail -f /var/log/access.log
agent1.sources.tail.channels = MemoryChannel-2

agent1.sinks.svc_0_sink.type = FILE_ROLL
agent1.sinks.svc_0_sink.sink.directory=/flume_runtime/logs
agent1.sinks.svc_0_sink.rollInterval=0

agent1.channels.MemoryChannel-2.type = memory


On Thu, Jun 14, 2012 at 4:26 AM, Guillaume Polaert <gp...@cyres.fr>wrote:

> Hi Bhaskar,
>
> This is the flume.conf (http://pastebin.com/WULgUuaf) what I'm using.
> I have an avro server on the hadoop-m host and one agent per node (slave
> hosts). Each agent send the ouput of a exec command to avro server.
>
> Host1 : exec -> memory -> avro (sink)
>
> Host2 : exec -> memory -> avro
>                                                >>>>>    MainHost : avro
> (source) -> memory -> rolling file (local FS)
> ...
>
> Host3 : exec -> memory -> avro
>
>
> Use your own exec command to read Apache log.
>
> Guillaume Polaert | Cyrès Conseil
>
> De : Bhaskar [mailto:bmarthi@gmail.com]
> Envoyé : mercredi 13 juin 2012 19:16
> À : flume-user@incubator.apache.org
> Objet : Newbee question about flume 1.2 set up
>
> Good Afternoon,
> I am a newbee to flume and read thru limited documentation available.  I
> would like to set up the following to test out.
>
> 1.  Read apache access logs (as source)
> 2.  Use memory channel
> 3.  Write it to a NFS (or even local) file system
>
> Can some one help me with the necessary configuration.  I am having
> difficult time to glean that information from available documentation.  I
> am sure someone has done such test before and i appreciate if you can pass
> on that information.  Secondly, I also would like to stream the logs to a
> remote server.  Is that a log4j configuration or do i need to run an agent
> on each host to do so?  Any configuration examples would be of great help.
>
> Thanks,
> Bhaskar
>

RE: Newbee question about flume 1.2 set up

Posted by Guillaume Polaert <gp...@cyres.fr>.
Hi Bhaskar,

This is the flume.conf (http://pastebin.com/WULgUuaf) what I'm using.
I have an avro server on the hadoop-m host and one agent per node (slave hosts). Each agent send the ouput of a exec command to avro server.

Host1 : exec -> memory -> avro (sink)

Host2 : exec -> memory -> avro
						>>>>>    MainHost : avro (source) -> memory -> rolling file (local FS)					
...

Host3 : exec -> memory -> avro


Use your own exec command to read Apache log.

Guillaume Polaert | Cyrès Conseil

De : Bhaskar [mailto:bmarthi@gmail.com] 
Envoyé : mercredi 13 juin 2012 19:16
À : flume-user@incubator.apache.org
Objet : Newbee question about flume 1.2 set up

Good Afternoon,
I am a newbee to flume and read thru limited documentation available.  I would like to set up the following to test out.

1.  Read apache access logs (as source)
2.  Use memory channel
3.  Write it to a NFS (or even local) file system

Can some one help me with the necessary configuration.  I am having difficult time to glean that information from available documentation.  I am sure someone has done such test before and i appreciate if you can pass on that information.  Secondly, I also would like to stream the logs to a remote server.  Is that a log4j configuration or do i need to run an agent on each host to do so?  Any configuration examples would be of great help.

Thanks,
Bhaskar

Re: Newbee question about flume 1.2 set up

Posted by Hari Shreedharan <hs...@cloudera.com>.
Hi Bhaskar, 

Yes, it is possible to write to the local file system. Please take a look at the RollingFileSink. The sink type is FILE_ROLL. It requires the following configuration:

sink.directory = <directory where you want the data>
sink.rollInterval = <roll interval> ->set this to 0 if you don't want to roll the file.

Also, optionally: sink.serializer = <serializer class> -> By default will write as text.

Hope this helps.

Thanks,
Hari

-- 
Hari Shreedharan


On Wednesday, June 13, 2012 at 12:38 PM, Bhaskar wrote:

> Thank you Mohammad for prompt response.  I built it from source and have tried few combinations so far.  I do not have HDFS set up yet.  I was trying to use local file system as sink.  Is that a feasible option?
> 
> Bhaskar
> 
> On Wed, Jun 13, 2012 at 1:45 PM, Mohammad Tariq <dontariq@gmail.com (mailto:dontariq@gmail.com)> wrote:
> > Hello Bhaskar,
> > 
> >         The very first step would be to build flume-ng from
> > trunk..you can use following commands to do that -
> > 
> > $ svn co https://svn.apache.org/repos/asf/incubator/flume/trunk flume
> > $ cd flume
> > $ mvn3 install -DskipTests
> > 
> > I would suggest to use Maven 3.0.3, as you may find some problems with
> > Maven2..Once you are done with your build, you need to write the
> > configuration files for your agents..For example, to collect apache
> > web server logs into  the Hdfs, the basic configuration would be
> > something like this -
> > 
> > agent1.sources = tail
> > agent1.channels = MemoryChannel-2
> > agent1.sinks = HDFS
> > 
> > agent1.sources.tail.type = exec
> > agent1.sources.tail.command = tail -F /var/log/apache2/access.log
> > agent1.sources.tail.channels = MemoryChannel-2
> > 
> > agent1.sinks.HDFS.channel = MemoryChannel-2
> > agent1.sinks.HDFS.type = hdfs
> > agent1.sinks.HDFS.hdfs.path = hdfs://localhost:9000/flume
> > agent1.sinks.HDFS.hdfs.file.Type = DataStream
> > 
> > agent1.channels.MemoryChannel-2.type = memory
> > 
> > Save this file as agent1.conf inside your flume-ng/conf directory and
> > start your agent using -
> > 
> > $ bin/flume-ng agent -n agent1 -f conf/agent1.conf
> > 
> > 
> > Regards,
> >     Mohammad Tariq
> > 
> > 
> > On Wed, Jun 13, 2012 at 10:45 PM, Bhaskar <bmarthi@gmail.com (mailto:bmarthi@gmail.com)> wrote:
> > > Good Afternoon,
> > > I am a newbee to flume and read thru limited documentation available.  I
> > > would like to set up the following to test out.
> > >
> > > 1.  Read apache access logs (as source)
> > > 2.  Use memory channel
> > > 3.  Write it to a NFS (or even local) file system
> > >
> > > Can some one help me with the necessary configuration.  I am having
> > > difficult time to glean that information from available documentation.  I am
> > > sure someone has done such test before and i appreciate if you can pass on
> > > that information.  Secondly, I also would like to stream the logs to a
> > > remote server.  Is that a log4j configuration or do i need to run an agent
> > > on each host to do so?  Any configuration examples would be of great help.
> > >
> > > Thanks,
> > > Bhaskar
> 


Re: Newbee question about flume 1.2 set up

Posted by Mohammad Tariq <do...@gmail.com>.
Good luck :)
Regards,
    Mohammad Tariq


On Thu, Jun 14, 2012 at 2:07 AM, Bhaskar <bm...@gmail.com> wrote:
> Thank you both.  I am going to try it with both as POC.
>
> Bhaskar
>
>
> On Wed, Jun 13, 2012 at 4:28 PM, Mohammad Tariq <do...@gmail.com> wrote:
>>
>> As said by Ralph, Hdfs or Local fs are not the only options..you can
>> dump your data into any store like Cassandra, Hbase etc etc
>>
>> Regards,
>>     Mohammad Tariq
>>
>>
>> On Thu, Jun 14, 2012 at 1:54 AM, Ralph Goers <ra...@dslextreme.com>
>> wrote:
>> > FWIW, Hadoop is not the only target repository. We are using Cassandra.
>> >
>> > Ralph
>> >
>> > On Jun 13, 2012, at 1:19 PM, Mohammad Tariq wrote:
>> >
>> >> it's absolutely ok for initial learning, but not feasible for
>> >> production or if you want to evaluate hadoop ecosystem properly..I
>> >> would suggest to setup a hadoop cluster in pseudo mode on your pc
>> >> first..or if you do not require hadoopat all then there is no problem.
>> >>
>> >> Regards,
>> >>     Mohammad Tariq
>> >>
>> >>
>> >> On Thu, Jun 14, 2012 at 1:08 AM, Bhaskar <bm...@gmail.com> wrote:
>> >>> Thank you Mohammad for prompt response.  I built it from source and
>> >>> have
>> >>> tried few combinations so far.  I do not have HDFS set up yet.  I was
>> >>> trying
>> >>> to use local file system as sink.  Is that a feasible option?
>> >>>
>> >>> Bhaskar
>> >>>
>> >>>
>> >>> On Wed, Jun 13, 2012 at 1:45 PM, Mohammad Tariq <do...@gmail.com>
>> >>> wrote:
>> >>>>
>> >>>> Hello Bhaskar,
>> >>>>
>> >>>>         The very first step would be to build flume-ng from
>> >>>> trunk..you can use following commands to do that -
>> >>>>
>> >>>> $ svn co https://svn.apache.org/repos/asf/incubator/flume/trunk flume
>> >>>> $ cd flume
>> >>>> $ mvn3 install -DskipTests
>> >>>>
>> >>>> I would suggest to use Maven 3.0.3, as you may find some problems
>> >>>> with
>> >>>> Maven2..Once you are done with your build, you need to write the
>> >>>> configuration files for your agents..For example, to collect apache
>> >>>> web server logs into  the Hdfs, the basic configuration would be
>> >>>> something like this -
>> >>>>
>> >>>> agent1.sources = tail
>> >>>> agent1.channels = MemoryChannel-2
>> >>>> agent1.sinks = HDFS
>> >>>>
>> >>>> agent1.sources.tail.type = exec
>> >>>> agent1.sources.tail.command = tail -F /var/log/apache2/access.log
>> >>>> agent1.sources.tail.channels = MemoryChannel-2
>> >>>>
>> >>>> agent1.sinks.HDFS.channel = MemoryChannel-2
>> >>>> agent1.sinks.HDFS.type = hdfs
>> >>>> agent1.sinks.HDFS.hdfs.path = hdfs://localhost:9000/flume
>> >>>> agent1.sinks.HDFS.hdfs.file.Type = DataStream
>> >>>>
>> >>>> agent1.channels.MemoryChannel-2.type = memory
>> >>>>
>> >>>> Save this file as agent1.conf inside your flume-ng/conf directory and
>> >>>> start your agent using -
>> >>>>
>> >>>> $ bin/flume-ng agent -n agent1 -f conf/agent1.conf
>> >>>>
>> >>>>
>> >>>> Regards,
>> >>>>     Mohammad Tariq
>> >>>>
>> >>>>
>> >>>> On Wed, Jun 13, 2012 at 10:45 PM, Bhaskar <bm...@gmail.com> wrote:
>> >>>>> Good Afternoon,
>> >>>>> I am a newbee to flume and read thru limited documentation
>> >>>>> available.  I
>> >>>>> would like to set up the following to test out.
>> >>>>>
>> >>>>> 1.  Read apache access logs (as source)
>> >>>>> 2.  Use memory channel
>> >>>>> 3.  Write it to a NFS (or even local) file system
>> >>>>>
>> >>>>> Can some one help me with the necessary configuration.  I am having
>> >>>>> difficult time to glean that information from available
>> >>>>> documentation.
>> >>>>>  I am
>> >>>>> sure someone has done such test before and i appreciate if you can
>> >>>>> pass
>> >>>>> on
>> >>>>> that information.  Secondly, I also would like to stream the logs to
>> >>>>> a
>> >>>>> remote server.  Is that a log4j configuration or do i need to run an
>> >>>>> agent
>> >>>>> on each host to do so?  Any configuration examples would be of great
>> >>>>> help.
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Bhaskar
>> >>>
>> >>>
>> >
>
>

Re: Newbee question about flume 1.2 set up

Posted by Bhaskar <bm...@gmail.com>.
Thank you both.  I am going to try it with both as POC.

Bhaskar

On Wed, Jun 13, 2012 at 4:28 PM, Mohammad Tariq <do...@gmail.com> wrote:

> As said by Ralph, Hdfs or Local fs are not the only options..you can
> dump your data into any store like Cassandra, Hbase etc etc
>
> Regards,
>     Mohammad Tariq
>
>
> On Thu, Jun 14, 2012 at 1:54 AM, Ralph Goers <ra...@dslextreme.com>
> wrote:
> > FWIW, Hadoop is not the only target repository. We are using Cassandra.
> >
> > Ralph
> >
> > On Jun 13, 2012, at 1:19 PM, Mohammad Tariq wrote:
> >
> >> it's absolutely ok for initial learning, but not feasible for
> >> production or if you want to evaluate hadoop ecosystem properly..I
> >> would suggest to setup a hadoop cluster in pseudo mode on your pc
> >> first..or if you do not require hadoopat all then there is no problem.
> >>
> >> Regards,
> >>     Mohammad Tariq
> >>
> >>
> >> On Thu, Jun 14, 2012 at 1:08 AM, Bhaskar <bm...@gmail.com> wrote:
> >>> Thank you Mohammad for prompt response.  I built it from source and
> have
> >>> tried few combinations so far.  I do not have HDFS set up yet.  I was
> trying
> >>> to use local file system as sink.  Is that a feasible option?
> >>>
> >>> Bhaskar
> >>>
> >>>
> >>> On Wed, Jun 13, 2012 at 1:45 PM, Mohammad Tariq <do...@gmail.com>
> wrote:
> >>>>
> >>>> Hello Bhaskar,
> >>>>
> >>>>         The very first step would be to build flume-ng from
> >>>> trunk..you can use following commands to do that -
> >>>>
> >>>> $ svn co https://svn.apache.org/repos/asf/incubator/flume/trunk flume
> >>>> $ cd flume
> >>>> $ mvn3 install -DskipTests
> >>>>
> >>>> I would suggest to use Maven 3.0.3, as you may find some problems with
> >>>> Maven2..Once you are done with your build, you need to write the
> >>>> configuration files for your agents..For example, to collect apache
> >>>> web server logs into  the Hdfs, the basic configuration would be
> >>>> something like this -
> >>>>
> >>>> agent1.sources = tail
> >>>> agent1.channels = MemoryChannel-2
> >>>> agent1.sinks = HDFS
> >>>>
> >>>> agent1.sources.tail.type = exec
> >>>> agent1.sources.tail.command = tail -F /var/log/apache2/access.log
> >>>> agent1.sources.tail.channels = MemoryChannel-2
> >>>>
> >>>> agent1.sinks.HDFS.channel = MemoryChannel-2
> >>>> agent1.sinks.HDFS.type = hdfs
> >>>> agent1.sinks.HDFS.hdfs.path = hdfs://localhost:9000/flume
> >>>> agent1.sinks.HDFS.hdfs.file.Type = DataStream
> >>>>
> >>>> agent1.channels.MemoryChannel-2.type = memory
> >>>>
> >>>> Save this file as agent1.conf inside your flume-ng/conf directory and
> >>>> start your agent using -
> >>>>
> >>>> $ bin/flume-ng agent -n agent1 -f conf/agent1.conf
> >>>>
> >>>>
> >>>> Regards,
> >>>>     Mohammad Tariq
> >>>>
> >>>>
> >>>> On Wed, Jun 13, 2012 at 10:45 PM, Bhaskar <bm...@gmail.com> wrote:
> >>>>> Good Afternoon,
> >>>>> I am a newbee to flume and read thru limited documentation
> available.  I
> >>>>> would like to set up the following to test out.
> >>>>>
> >>>>> 1.  Read apache access logs (as source)
> >>>>> 2.  Use memory channel
> >>>>> 3.  Write it to a NFS (or even local) file system
> >>>>>
> >>>>> Can some one help me with the necessary configuration.  I am having
> >>>>> difficult time to glean that information from available
> documentation.
> >>>>>  I am
> >>>>> sure someone has done such test before and i appreciate if you can
> pass
> >>>>> on
> >>>>> that information.  Secondly, I also would like to stream the logs to
> a
> >>>>> remote server.  Is that a log4j configuration or do i need to run an
> >>>>> agent
> >>>>> on each host to do so?  Any configuration examples would be of great
> >>>>> help.
> >>>>>
> >>>>> Thanks,
> >>>>> Bhaskar
> >>>
> >>>
> >
>

Re: Newbee question about flume 1.2 set up

Posted by Mohammad Tariq <do...@gmail.com>.
As said by Ralph, Hdfs or Local fs are not the only options..you can
dump your data into any store like Cassandra, Hbase etc etc

Regards,
    Mohammad Tariq


On Thu, Jun 14, 2012 at 1:54 AM, Ralph Goers <ra...@dslextreme.com> wrote:
> FWIW, Hadoop is not the only target repository. We are using Cassandra.
>
> Ralph
>
> On Jun 13, 2012, at 1:19 PM, Mohammad Tariq wrote:
>
>> it's absolutely ok for initial learning, but not feasible for
>> production or if you want to evaluate hadoop ecosystem properly..I
>> would suggest to setup a hadoop cluster in pseudo mode on your pc
>> first..or if you do not require hadoopat all then there is no problem.
>>
>> Regards,
>>     Mohammad Tariq
>>
>>
>> On Thu, Jun 14, 2012 at 1:08 AM, Bhaskar <bm...@gmail.com> wrote:
>>> Thank you Mohammad for prompt response.  I built it from source and have
>>> tried few combinations so far.  I do not have HDFS set up yet.  I was trying
>>> to use local file system as sink.  Is that a feasible option?
>>>
>>> Bhaskar
>>>
>>>
>>> On Wed, Jun 13, 2012 at 1:45 PM, Mohammad Tariq <do...@gmail.com> wrote:
>>>>
>>>> Hello Bhaskar,
>>>>
>>>>         The very first step would be to build flume-ng from
>>>> trunk..you can use following commands to do that -
>>>>
>>>> $ svn co https://svn.apache.org/repos/asf/incubator/flume/trunk flume
>>>> $ cd flume
>>>> $ mvn3 install -DskipTests
>>>>
>>>> I would suggest to use Maven 3.0.3, as you may find some problems with
>>>> Maven2..Once you are done with your build, you need to write the
>>>> configuration files for your agents..For example, to collect apache
>>>> web server logs into  the Hdfs, the basic configuration would be
>>>> something like this -
>>>>
>>>> agent1.sources = tail
>>>> agent1.channels = MemoryChannel-2
>>>> agent1.sinks = HDFS
>>>>
>>>> agent1.sources.tail.type = exec
>>>> agent1.sources.tail.command = tail -F /var/log/apache2/access.log
>>>> agent1.sources.tail.channels = MemoryChannel-2
>>>>
>>>> agent1.sinks.HDFS.channel = MemoryChannel-2
>>>> agent1.sinks.HDFS.type = hdfs
>>>> agent1.sinks.HDFS.hdfs.path = hdfs://localhost:9000/flume
>>>> agent1.sinks.HDFS.hdfs.file.Type = DataStream
>>>>
>>>> agent1.channels.MemoryChannel-2.type = memory
>>>>
>>>> Save this file as agent1.conf inside your flume-ng/conf directory and
>>>> start your agent using -
>>>>
>>>> $ bin/flume-ng agent -n agent1 -f conf/agent1.conf
>>>>
>>>>
>>>> Regards,
>>>>     Mohammad Tariq
>>>>
>>>>
>>>> On Wed, Jun 13, 2012 at 10:45 PM, Bhaskar <bm...@gmail.com> wrote:
>>>>> Good Afternoon,
>>>>> I am a newbee to flume and read thru limited documentation available.  I
>>>>> would like to set up the following to test out.
>>>>>
>>>>> 1.  Read apache access logs (as source)
>>>>> 2.  Use memory channel
>>>>> 3.  Write it to a NFS (or even local) file system
>>>>>
>>>>> Can some one help me with the necessary configuration.  I am having
>>>>> difficult time to glean that information from available documentation.
>>>>>  I am
>>>>> sure someone has done such test before and i appreciate if you can pass
>>>>> on
>>>>> that information.  Secondly, I also would like to stream the logs to a
>>>>> remote server.  Is that a log4j configuration or do i need to run an
>>>>> agent
>>>>> on each host to do so?  Any configuration examples would be of great
>>>>> help.
>>>>>
>>>>> Thanks,
>>>>> Bhaskar
>>>
>>>
>

Re: Newbee question about flume 1.2 set up

Posted by Ralph Goers <ra...@dslextreme.com>.
FWIW, Hadoop is not the only target repository. We are using Cassandra.

Ralph

On Jun 13, 2012, at 1:19 PM, Mohammad Tariq wrote:

> it's absolutely ok for initial learning, but not feasible for
> production or if you want to evaluate hadoop ecosystem properly..I
> would suggest to setup a hadoop cluster in pseudo mode on your pc
> first..or if you do not require hadoopat all then there is no problem.
> 
> Regards,
>     Mohammad Tariq
> 
> 
> On Thu, Jun 14, 2012 at 1:08 AM, Bhaskar <bm...@gmail.com> wrote:
>> Thank you Mohammad for prompt response.  I built it from source and have
>> tried few combinations so far.  I do not have HDFS set up yet.  I was trying
>> to use local file system as sink.  Is that a feasible option?
>> 
>> Bhaskar
>> 
>> 
>> On Wed, Jun 13, 2012 at 1:45 PM, Mohammad Tariq <do...@gmail.com> wrote:
>>> 
>>> Hello Bhaskar,
>>> 
>>>         The very first step would be to build flume-ng from
>>> trunk..you can use following commands to do that -
>>> 
>>> $ svn co https://svn.apache.org/repos/asf/incubator/flume/trunk flume
>>> $ cd flume
>>> $ mvn3 install -DskipTests
>>> 
>>> I would suggest to use Maven 3.0.3, as you may find some problems with
>>> Maven2..Once you are done with your build, you need to write the
>>> configuration files for your agents..For example, to collect apache
>>> web server logs into  the Hdfs, the basic configuration would be
>>> something like this -
>>> 
>>> agent1.sources = tail
>>> agent1.channels = MemoryChannel-2
>>> agent1.sinks = HDFS
>>> 
>>> agent1.sources.tail.type = exec
>>> agent1.sources.tail.command = tail -F /var/log/apache2/access.log
>>> agent1.sources.tail.channels = MemoryChannel-2
>>> 
>>> agent1.sinks.HDFS.channel = MemoryChannel-2
>>> agent1.sinks.HDFS.type = hdfs
>>> agent1.sinks.HDFS.hdfs.path = hdfs://localhost:9000/flume
>>> agent1.sinks.HDFS.hdfs.file.Type = DataStream
>>> 
>>> agent1.channels.MemoryChannel-2.type = memory
>>> 
>>> Save this file as agent1.conf inside your flume-ng/conf directory and
>>> start your agent using -
>>> 
>>> $ bin/flume-ng agent -n agent1 -f conf/agent1.conf
>>> 
>>> 
>>> Regards,
>>>     Mohammad Tariq
>>> 
>>> 
>>> On Wed, Jun 13, 2012 at 10:45 PM, Bhaskar <bm...@gmail.com> wrote:
>>>> Good Afternoon,
>>>> I am a newbee to flume and read thru limited documentation available.  I
>>>> would like to set up the following to test out.
>>>> 
>>>> 1.  Read apache access logs (as source)
>>>> 2.  Use memory channel
>>>> 3.  Write it to a NFS (or even local) file system
>>>> 
>>>> Can some one help me with the necessary configuration.  I am having
>>>> difficult time to glean that information from available documentation.
>>>>  I am
>>>> sure someone has done such test before and i appreciate if you can pass
>>>> on
>>>> that information.  Secondly, I also would like to stream the logs to a
>>>> remote server.  Is that a log4j configuration or do i need to run an
>>>> agent
>>>> on each host to do so?  Any configuration examples would be of great
>>>> help.
>>>> 
>>>> Thanks,
>>>> Bhaskar
>> 
>> 


Re: Newbee question about flume 1.2 set up

Posted by Mohammad Tariq <do...@gmail.com>.
it's absolutely ok for initial learning, but not feasible for
production or if you want to evaluate hadoop ecosystem properly..I
would suggest to setup a hadoop cluster in pseudo mode on your pc
first..or if you do not require hadoopat all then there is no problem.

Regards,
    Mohammad Tariq


On Thu, Jun 14, 2012 at 1:08 AM, Bhaskar <bm...@gmail.com> wrote:
> Thank you Mohammad for prompt response.  I built it from source and have
> tried few combinations so far.  I do not have HDFS set up yet.  I was trying
> to use local file system as sink.  Is that a feasible option?
>
> Bhaskar
>
>
> On Wed, Jun 13, 2012 at 1:45 PM, Mohammad Tariq <do...@gmail.com> wrote:
>>
>> Hello Bhaskar,
>>
>>         The very first step would be to build flume-ng from
>> trunk..you can use following commands to do that -
>>
>> $ svn co https://svn.apache.org/repos/asf/incubator/flume/trunk flume
>> $ cd flume
>> $ mvn3 install -DskipTests
>>
>> I would suggest to use Maven 3.0.3, as you may find some problems with
>> Maven2..Once you are done with your build, you need to write the
>> configuration files for your agents..For example, to collect apache
>> web server logs into  the Hdfs, the basic configuration would be
>> something like this -
>>
>> agent1.sources = tail
>> agent1.channels = MemoryChannel-2
>> agent1.sinks = HDFS
>>
>> agent1.sources.tail.type = exec
>> agent1.sources.tail.command = tail -F /var/log/apache2/access.log
>> agent1.sources.tail.channels = MemoryChannel-2
>>
>> agent1.sinks.HDFS.channel = MemoryChannel-2
>> agent1.sinks.HDFS.type = hdfs
>> agent1.sinks.HDFS.hdfs.path = hdfs://localhost:9000/flume
>> agent1.sinks.HDFS.hdfs.file.Type = DataStream
>>
>> agent1.channels.MemoryChannel-2.type = memory
>>
>> Save this file as agent1.conf inside your flume-ng/conf directory and
>> start your agent using -
>>
>> $ bin/flume-ng agent -n agent1 -f conf/agent1.conf
>>
>>
>> Regards,
>>     Mohammad Tariq
>>
>>
>> On Wed, Jun 13, 2012 at 10:45 PM, Bhaskar <bm...@gmail.com> wrote:
>> > Good Afternoon,
>> > I am a newbee to flume and read thru limited documentation available.  I
>> > would like to set up the following to test out.
>> >
>> > 1.  Read apache access logs (as source)
>> > 2.  Use memory channel
>> > 3.  Write it to a NFS (or even local) file system
>> >
>> > Can some one help me with the necessary configuration.  I am having
>> > difficult time to glean that information from available documentation.
>> >  I am
>> > sure someone has done such test before and i appreciate if you can pass
>> > on
>> > that information.  Secondly, I also would like to stream the logs to a
>> > remote server.  Is that a log4j configuration or do i need to run an
>> > agent
>> > on each host to do so?  Any configuration examples would be of great
>> > help.
>> >
>> > Thanks,
>> > Bhaskar
>
>

Re: Newbee question about flume 1.2 set up

Posted by Bhaskar <bm...@gmail.com>.
Thank you Mohammad for prompt response.  I built it from source and have
tried few combinations so far.  I do not have HDFS set up yet.  I was
trying to use local file system as sink.  Is that a feasible option?

Bhaskar

On Wed, Jun 13, 2012 at 1:45 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello Bhaskar,
>
>         The very first step would be to build flume-ng from
> trunk..you can use following commands to do that -
>
> $ svn co https://svn.apache.org/repos/asf/incubator/flume/trunk flume
> $ cd flume
> $ mvn3 install -DskipTests
>
> I would suggest to use Maven 3.0.3, as you may find some problems with
> Maven2..Once you are done with your build, you need to write the
> configuration files for your agents..For example, to collect apache
> web server logs into  the Hdfs, the basic configuration would be
> something like this -
>
> agent1.sources = tail
> agent1.channels = MemoryChannel-2
> agent1.sinks = HDFS
>
> agent1.sources.tail.type = exec
> agent1.sources.tail.command = tail -F /var/log/apache2/access.log
> agent1.sources.tail.channels = MemoryChannel-2
>
> agent1.sinks.HDFS.channel = MemoryChannel-2
> agent1.sinks.HDFS.type = hdfs
> agent1.sinks.HDFS.hdfs.path = hdfs://localhost:9000/flume
> agent1.sinks.HDFS.hdfs.file.Type = DataStream
>
> agent1.channels.MemoryChannel-2.type = memory
>
> Save this file as agent1.conf inside your flume-ng/conf directory and
> start your agent using -
>
> $ bin/flume-ng agent -n agent1 -f conf/agent1.conf
>
>
> Regards,
>     Mohammad Tariq
>
>
> On Wed, Jun 13, 2012 at 10:45 PM, Bhaskar <bm...@gmail.com> wrote:
> > Good Afternoon,
> > I am a newbee to flume and read thru limited documentation available.  I
> > would like to set up the following to test out.
> >
> > 1.  Read apache access logs (as source)
> > 2.  Use memory channel
> > 3.  Write it to a NFS (or even local) file system
> >
> > Can some one help me with the necessary configuration.  I am having
> > difficult time to glean that information from available documentation.
>  I am
> > sure someone has done such test before and i appreciate if you can pass
> on
> > that information.  Secondly, I also would like to stream the logs to a
> > remote server.  Is that a log4j configuration or do i need to run an
> agent
> > on each host to do so?  Any configuration examples would be of great
> help.
> >
> > Thanks,
> > Bhaskar
>

Re: Newbee question about flume 1.2 set up

Posted by Mohammad Tariq <do...@gmail.com>.
Hello Bhaskar,

         The very first step would be to build flume-ng from
trunk..you can use following commands to do that -

$ svn co https://svn.apache.org/repos/asf/incubator/flume/trunk flume
$ cd flume
$ mvn3 install -DskipTests

I would suggest to use Maven 3.0.3, as you may find some problems with
Maven2..Once you are done with your build, you need to write the
configuration files for your agents..For example, to collect apache
web server logs into  the Hdfs, the basic configuration would be
something like this -

agent1.sources = tail
agent1.channels = MemoryChannel-2
agent1.sinks = HDFS

agent1.sources.tail.type = exec
agent1.sources.tail.command = tail -F /var/log/apache2/access.log
agent1.sources.tail.channels = MemoryChannel-2

agent1.sinks.HDFS.channel = MemoryChannel-2
agent1.sinks.HDFS.type = hdfs
agent1.sinks.HDFS.hdfs.path = hdfs://localhost:9000/flume
agent1.sinks.HDFS.hdfs.file.Type = DataStream

agent1.channels.MemoryChannel-2.type = memory

Save this file as agent1.conf inside your flume-ng/conf directory and
start your agent using -

$ bin/flume-ng agent -n agent1 -f conf/agent1.conf


Regards,
    Mohammad Tariq


On Wed, Jun 13, 2012 at 10:45 PM, Bhaskar <bm...@gmail.com> wrote:
> Good Afternoon,
> I am a newbee to flume and read thru limited documentation available.  I
> would like to set up the following to test out.
>
> 1.  Read apache access logs (as source)
> 2.  Use memory channel
> 3.  Write it to a NFS (or even local) file system
>
> Can some one help me with the necessary configuration.  I am having
> difficult time to glean that information from available documentation.  I am
> sure someone has done such test before and i appreciate if you can pass on
> that information.  Secondly, I also would like to stream the logs to a
> remote server.  Is that a log4j configuration or do i need to run an agent
> on each host to do so?  Any configuration examples would be of great help.
>
> Thanks,
> Bhaskar