You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by 周梦想 <ab...@gmail.com> on 2013/01/29 08:24:40 UTC

flume tail source problem and performance

hello,
1. I want to tail a log source and write it to hdfs. below is configure：
config [ag1, tail("/home/zhouhh/game.log",startFromEnd=true),
agentDFOSink("hadoop48",35853) ;]
config [ag2, tail("/home/zhouhh/game.log",startFromEnd=true),
agentDFOSink("hadoop48",35853) ;]
config [co1, collectorSource( 35853 ),  [collectorSink(
"hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
"hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]


I found if I restart the agent node, it will resend the content of game.log
to collector. There are some solutions to send logs from where I haven't
sent before? Or I have to make a mark myself or remove the logs manually
when restart the agent node?

2. I tested performance of flume, and found it's a bit slow.
if I using configure as above, there are only 50MB/minute.
I changed the configure to below:
ag1:tail("/home/zhouhh/game.log",startFromEnd=true)|batch(1000) gzip
agentDFOSink("hadoop48",35853);

config [co1, collectorSource( 35853 ), [collectorSink(
"hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
"hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]

I sent 300MB log, it will spent about 3 minutes, so it's about 100MB/minute.

while I send the log from ag1 to co1 via scp, It's about 30MB/second.

someone give me any ideas?

thanks!

Andy

Re: flume tail source problem and performance

Posted by Alexander Alten-Lorenz <wg...@gmail.com>.

Hi Andy,

I mentioned more a own program / script to parse the data (instead tail -*) to have some control about the contents. Note, when a flume agent will be restarted, the marker for tail will be lost too. This comes from tail itself, flume hasn't a control about. 

- Alex

On Feb 4, 2013, at 8:33 AM, 周梦想 <ab...@gmail.com> wrote:

> Hi Alex,
> 
> You mean I write a script to check the directories?
> [zhouhh@Hadoop46 ag1]$ pwd
> /tmp/flume-zhouhh/agent/ag1
> [zhouhh@Hadoop46 ag1]$ ls
> dfo_error  dfo_import  dfo_logged  dfo_sending  dfo_writing  done  error
> import  logged  sending  sent  writing
> 
> how to check to avoid lost data and disable resend data ? clean sending dir?
> 
> thanks!
> Andy
> 
> 2013/1/29 Alexander Alten-Lorenz <wg...@gmail.com>
> 
>> Hi,
>> 
>> you could use tail -F, but this depends on the external source. Flume
>> hasn't control about. You can write your own script and include this.
>> 
>> What's the content of:
>> /tmp/flume/agent/agent*.*/ - directories? Are sent and sending clean?
>> 
>> - Alex
>> 
>> On Jan 29, 2013, at 8:24 AM, 周梦想 <ab...@gmail.com> wrote:
>> 
>>> hello,
>>> 1. I want to tail a log source and write it to hdfs. below is configure：
>>> config [ag1, tail("/home/zhouhh/game.log",startFromEnd=true),
>>> agentDFOSink("hadoop48",35853) ;]
>>> config [ag2, tail("/home/zhouhh/game.log",startFromEnd=true),
>>> agentDFOSink("hadoop48",35853) ;]
>>> config [co1, collectorSource( 35853 ),  [collectorSink(
>>> 
>> "hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
>>> "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]
>>> 
>>> 
>>> I found if I restart the agent node, it will resend the content of
>> game.log
>>> to collector. There are some solutions to send logs from where I haven't
>>> sent before? Or I have to make a mark myself or remove the logs manually
>>> when restart the agent node?
>>> 
>>> 2. I tested performance of flume, and found it's a bit slow.
>>> if I using configure as above, there are only 50MB/minute.
>>> I changed the configure to below:
>>> ag1:tail("/home/zhouhh/game.log",startFromEnd=true)|batch(1000) gzip
>>> agentDFOSink("hadoop48",35853);
>>> 
>>> config [co1, collectorSource( 35853 ), [collectorSink(
>>> 
>> "hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
>>> "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]
>>> 
>>> I sent 300MB log, it will spent about 3 minutes, so it's about
>> 100MB/minute.
>>> 
>>> while I send the log from ag1 to co1 via scp, It's about 30MB/second.
>>> 
>>> someone give me any ideas?
>>> 
>>> thanks!
>>> 
>>> Andy
>> 
>> --
>> Alexander Alten-Lorenz
>> http://mapredit.blogspot.com
>> German Hadoop LinkedIn Group: http://goo.gl/N8pCF
>> 
>> 

--
Alexander Alten-Lorenz
http://mapredit.blogspot.com
German Hadoop LinkedIn Group: http://goo.gl/N8pCF

Re: flume tail source problem and performance

Posted by 周梦想 <ab...@gmail.com>.

Hi Alex,

You mean I write a script to check the directories?
[zhouhh@Hadoop46 ag1]$ pwd
/tmp/flume-zhouhh/agent/ag1
[zhouhh@Hadoop46 ag1]$ ls
dfo_error  dfo_import  dfo_logged  dfo_sending  dfo_writing  done  error
 import  logged  sending  sent  writing

how to check to avoid lost data and disable resend data ? clean sending dir?

thanks!
Andy

2013/1/29 Alexander Alten-Lorenz <wg...@gmail.com>

> Hi,
>
> you could use tail -F, but this depends on the external source. Flume
> hasn't control about. You can write your own script and include this.
>
> What's the content of:
> /tmp/flume/agent/agent*.*/ - directories? Are sent and sending clean?
>
> - Alex
>
> On Jan 29, 2013, at 8:24 AM, 周梦想 <ab...@gmail.com> wrote:
>
> > hello,
> > 1. I want to tail a log source and write it to hdfs. below is configure：
> > config [ag1, tail("/home/zhouhh/game.log",startFromEnd=true),
> > agentDFOSink("hadoop48",35853) ;]
> > config [ag2, tail("/home/zhouhh/game.log",startFromEnd=true),
> > agentDFOSink("hadoop48",35853) ;]
> > config [co1, collectorSource( 35853 ),  [collectorSink(
> >
> "hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
> > "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]
> >
> >
> > I found if I restart the agent node, it will resend the content of
> game.log
> > to collector. There are some solutions to send logs from where I haven't
> > sent before? Or I have to make a mark myself or remove the logs manually
> > when restart the agent node?
> >
> > 2. I tested performance of flume, and found it's a bit slow.
> > if I using configure as above, there are only 50MB/minute.
> > I changed the configure to below:
> > ag1:tail("/home/zhouhh/game.log",startFromEnd=true)|batch(1000) gzip
> > agentDFOSink("hadoop48",35853);
> >
> > config [co1, collectorSource( 35853 ), [collectorSink(
> >
> "hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
> > "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]
> >
> > I sent 300MB log, it will spent about 3 minutes, so it's about
> 100MB/minute.
> >
> > while I send the log from ag1 to co1 via scp, It's about 30MB/second.
> >
> > someone give me any ideas?
> >
> > thanks!
> >
> > Andy
>
> --
> Alexander Alten-Lorenz
> http://mapredit.blogspot.com
> German Hadoop LinkedIn Group: http://goo.gl/N8pCF
>
>

Re: flume tail source problem and performance

Posted by 周梦想 <ab...@gmail.com>.

Thanks to JS and GuoWei, I'll try this.

Best Regards,
Andy Zhou

2013/2/4 GuoWei <we...@wbkit.com>

>
>  Exactly, It's very easy to write custom source and sink in flume. You can
> just follow the default source or sink as example. And export as jar file
> to the lib folder of flume. Then you can configure your own source or sink
> in flume conf file.
>
>
> Best Regards
> Guo Wei
>
>
> On 2013-2-4, at 下午4:13, Jeong-shik Jang <js...@gruter.com> wrote:
>
>  Yes, you can; Flume plugin framework provides easy way to implement and
> apply your own source, deco and sink.
>
> -JS
>
> On 2/4/13 5:07 PM, 周梦想 wrote:
>
> Hi JS，
>
>  Thank you for your reply. So there is big shortness of collect log using
> flume. can I write my own agent to send logs  via  thrift protocol directly
> to collector server?
>
>  Best Regards,
> Andy Zhou
>
>
> 2013/2/4 Jeong-shik Jang <js...@gruter.com>
>
>>  Hi Andy,
>>
>> 1. "startFromEnd=true" in your source configuration means data missing
>> can happen at restart in tail side because flume will ignore any data event
>> generated during restart and start at the end all the time.
>> 2. With agentSink, data duplication can happen due to ack delay from
>> master or at agent restart.
>>
>> I think it is why Flume-NG doesn't support tail any more but does let
>> user handle using script or program; tailing is a tricky job.
>>
>> My suggestion is to use agentBEChain in agent tier, and DFO in collector
>> tier; you can still lose some data during failover at failure.
>> To minimize loss and duplication, implementing checkpoint function in
>> tail also can help.
>>
>> Having monitoring system to detecting failure is very important as well,
>> so that you can notice failure and do recovering reaction quickly.
>>
>> -JS
>>
>>
>> On 2/4/13 4:27 PM, 周梦想 wrote:
>>
>> Hi  JS,
>> We can't accept agentBESink. Because this logs are important for data
>> analysis, we can't make any errors of the data. losing data, duplication
>> are all not acceptable.
>> one agent's configure is :   tail("H:/game.log", startFromEnd=true) agentSink("hadoop48",
>> 35853)
>> every time this windows agent restart, it will resend all the data to
>> collector server.
>> if some reason we restart the agent node, we can't get the mark of log
>> where the agent have sent.
>>
>>
>> 2013/1/29 Jeong-shik Jang <js...@gruter.com>
>>
>>> Hi Andy,
>>>
>>> As you set startFromEnd option true, resend might be caused by DFO
>>> mechanism (agentDFOSink); when you restart flume node in DFO mode, all
>>> events in different stages(logged, writing, sending and so on) rolls back
>>> to logged stage, which means resending and duplication.
>>>
>>> And, for better performance, you may want to use agentBESink instead of
>>> agentDFOSink.
>>> I recommend to use agentBEChain for failover in case of failure in
>>> collector tier if you have multiple collectors.
>>>
>>> -JS
>>>
>>>
>>> On 1/29/13 4:29 PM, Alexander Alten-Lorenz wrote:
>>>
>>>> Hi,
>>>>
>>>> you could use tail -F, but this depends on the external source. Flume
>>>> hasn't control about. You can write your own script and include this.
>>>>
>>>> What's the content of:
>>>> /tmp/flume/agent/agent*.*/ - directories? Are sent and sending clean?
>>>>
>>>> - Alex
>>>>
>>>> On Jan 29, 2013, at 8:24 AM, 周梦想 <ab...@gmail.com> wrote:
>>>>
>>>>  hello,
>>>>> 1. I want to tail a log source and write it to hdfs. below is
>>>>> configure：
>>>>> config [ag1, tail("/home/zhouhh/game.log",startFromEnd=true),
>>>>> agentDFOSink("hadoop48",35853) ;]
>>>>> config [ag2, tail("/home/zhouhh/game.log",startFromEnd=true),
>>>>> agentDFOSink("hadoop48",35853) ;]
>>>>> config [co1, collectorSource( 35853 ),  [collectorSink(
>>>>> "hdfs://hadoop48:54310/user/flume/%y%m/%d
>>>>> ","%{host}-",5000,raw),collectorSink(
>>>>> "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]
>>>>>
>>>>>
>>>>> I found if I restart the agent node, it will resend the content of
>>>>> game.log
>>>>> to collector. There are some solutions to send logs from where I
>>>>> haven't
>>>>> sent before? Or I have to make a mark myself or remove the logs
>>>>> manually
>>>>> when restart the agent node?
>>>>>
>>>>> 2. I tested performance of flume, and found it's a bit slow.
>>>>> if I using configure as above, there are only 50MB/minute.
>>>>> I changed the configure to below:
>>>>> ag1:tail("/home/zhouhh/game.log",startFromEnd=true)|batch(1000) gzip
>>>>> agentDFOSink("hadoop48",35853);
>>>>>
>>>>> config [co1, collectorSource( 35853 ), [collectorSink(
>>>>> "hdfs://hadoop48:54310/user/flume/%y%m/%d
>>>>> ","%{host}-",5000,raw),collectorSink(
>>>>> "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]
>>>>>
>>>>> I sent 300MB log, it will spent about 3 minutes, so it's about
>>>>> 100MB/minute.
>>>>>
>>>>> while I send the log from ag1 to co1 via scp, It's about 30MB/second.
>>>>>
>>>>> someone give me any ideas?
>>>>>
>>>>> thanks!
>>>>>
>>>>> Andy
>>>>>
>>>> --
>>>> Alexander Alten-Lorenz
>>>> http://mapredit.blogspot.com
>>>> German Hadoop LinkedIn Group: http://goo.gl/N8pCF
>>>>
>>>>
>>>>
>>>>
>>>
>>>   --
>>> Jeong-shik Jang / jsjang@gruter.com
>>> Gruter, Inc., R&D Team Leader
>>> www.gruter.com
>>> Enjoy Connecting
>>>
>>>
>>
>>
>> --
>> Jeong-shik Jang / jsjang@gruter.com
>> Gruter, Inc., R&D Team Leaderwww.gruter.com
>> Enjoy Connecting
>>
>>
>
>
> --
> Jeong-shik Jang / jsjang@gruter.com
> Gruter, Inc., R&D Team Leaderwww.gruter.com
> Enjoy Connecting
>
>
>

Re: flume tail source problem and performance

Posted by GuoWei <we...@wbkit.com>.

Exactly, It's very easy to write custom source and sink in flume. You can just follow the default source or sink as example. And export as jar file to the lib folder of flume. Then you can configure your own source or sink in flume conf file.


Best Regards
Guo Wei


On 2013-2-4, at 下午4:13, Jeong-shik Jang <js...@gruter.com> wrote:

> Yes, you can; Flume plugin framework provides easy way to implement and apply your own source, deco and sink.
> 
> -JS
> 
> On 2/4/13 5:07 PM, 周梦想 wrote:
>> Hi JS，
>> 
>> Thank you for your reply. So there is big shortness of collect log using flume. can I write my own agent to send logs  via  thrift protocol directly to collector server?
>> 
>> Best Regards,
>> Andy Zhou
>>  
>> 
>> 2013/2/4 Jeong-shik Jang <js...@gruter.com>
>> Hi Andy,
>> 
>> 1. "startFromEnd=true" in your source configuration means data missing can happen at restart in tail side because flume will ignore any data event generated during restart and start at the end all the time.
>> 2. With agentSink, data duplication can happen due to ack delay from master or at agent restart.
>> 
>> I think it is why Flume-NG doesn't support tail any more but does let user handle using script or program; tailing is a tricky job.
>> 
>> My suggestion is to use agentBEChain in agent tier, and DFO in collector tier; you can still lose some data during failover at failure.
>> To minimize loss and duplication, implementing checkpoint function in tail also can help.
>> 
>> Having monitoring system to detecting failure is very important as well, so that you can notice failure and do recovering reaction quickly.  
>> 
>> -JS
>> 
>> 
>> On 2/4/13 4:27 PM, 周梦想 wrote:
>>> Hi  JS,
>>> We can't accept agentBESink. Because this logs are important for data analysis, we can't make any errors of the data. losing data, duplication are all not acceptable.
>>> one agent's configure is : 
>>> tail("H:/game.log", startFromEnd=true)	agentSink("hadoop48", 35853)
>>> every time this windows agent restart, it will resend all the data to collector server. 
>>> if some reason we restart the agent node, we can't get the mark of log where the agent have sent. 
>>>  
>>> 
>>> 2013/1/29 Jeong-shik Jang <js...@gruter.com>
>>> Hi Andy,
>>> 
>>> As you set startFromEnd option true, resend might be caused by DFO mechanism (agentDFOSink); when you restart flume node in DFO mode, all events in different stages(logged, writing, sending and so on) rolls back to logged stage, which means resending and duplication.
>>> 
>>> And, for better performance, you may want to use agentBESink instead of agentDFOSink.
>>> I recommend to use agentBEChain for failover in case of failure in collector tier if you have multiple collectors.
>>> 
>>> -JS
>>> 
>>> 
>>> On 1/29/13 4:29 PM, Alexander Alten-Lorenz wrote:
>>> Hi,
>>> 
>>> you could use tail -F, but this depends on the external source. Flume hasn't control about. You can write your own script and include this.
>>> 
>>> What's the content of:
>>> /tmp/flume/agent/agent*.*/ - directories? Are sent and sending clean?
>>> 
>>> - Alex
>>> 
>>> On Jan 29, 2013, at 8:24 AM, 周梦想 <ab...@gmail.com> wrote:
>>> 
>>> hello,
>>> 1. I want to tail a log source and write it to hdfs. below is configure：
>>> config [ag1, tail("/home/zhouhh/game.log",startFromEnd=true),
>>> agentDFOSink("hadoop48",35853) ;]
>>> config [ag2, tail("/home/zhouhh/game.log",startFromEnd=true),
>>> agentDFOSink("hadoop48",35853) ;]
>>> config [co1, collectorSource( 35853 ),  [collectorSink(
>>> "hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
>>> "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]
>>> 
>>> 
>>> I found if I restart the agent node, it will resend the content of game.log
>>> to collector. There are some solutions to send logs from where I haven't
>>> sent before? Or I have to make a mark myself or remove the logs manually
>>> when restart the agent node?
>>> 
>>> 2. I tested performance of flume, and found it's a bit slow.
>>> if I using configure as above, there are only 50MB/minute.
>>> I changed the configure to below:
>>> ag1:tail("/home/zhouhh/game.log",startFromEnd=true)|batch(1000) gzip
>>> agentDFOSink("hadoop48",35853);
>>> 
>>> config [co1, collectorSource( 35853 ), [collectorSink(
>>> "hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
>>> "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]
>>> 
>>> I sent 300MB log, it will spent about 3 minutes, so it's about 100MB/minute.
>>> 
>>> while I send the log from ag1 to co1 via scp, It's about 30MB/second.
>>> 
>>> someone give me any ideas?
>>> 
>>> thanks!
>>> 
>>> Andy
>>> --
>>> Alexander Alten-Lorenz
>>> http://mapredit.blogspot.com
>>> German Hadoop LinkedIn Group: http://goo.gl/N8pCF
>>> 
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Jeong-shik Jang / jsjang@gruter.com
>>> Gruter, Inc., R&D Team Leader
>>> www.gruter.com
>>> Enjoy Connecting
>>> 
>>> 
>> 
>> 
>> -- 
>> Jeong-shik Jang / jsjang@gruter.com
>> Gruter, Inc., R&D Team Leader
>> www.gruter.com
>> Enjoy Connecting
>> 
> 
> 
> -- 
> Jeong-shik Jang / jsjang@gruter.com
> Gruter, Inc., R&D Team Leader
> www.gruter.com
> Enjoy Connecting

Re: flume tail source problem and performance

Posted by Jeong-shik Jang <js...@gruter.com>.

Yes, you can; Flume plugin framework provides easy way to implement and
apply your own source, deco and sink.

-JS

On 2/4/13 5:07 PM, 周梦想 wrote:
> Hi JS，
>
> Thank you for your reply. So there is big shortness of collect log
> using flume. can I write my own agent to send logs via thrift protocol
> directly to collector server?
>
> Best Regards,
> Andy Zhou
>
>
> 2013/2/4 Jeong-shik Jang <jsjang@gruter.com <ma...@gruter.com>>
>
>     Hi Andy,
>
>     1. "startFromEnd=true" in your source configuration means data
>     missing can happen at restart in tail side because flume will
>     ignore any data event generated during restart and start at the
>     end all the time.
>     2. With agentSink, data duplication can happen due to ack delay
>     from master or at agent restart.
>
>     I think it is why Flume-NG doesn't support tail any more but does
>     let user handle using script or program; tailing is a tricky job.
>
>     My suggestion is to use agentBEChain in agent tier, and DFO in
>     collector tier; you can still lose some data during failover at
>     failure.
>     To minimize loss and duplication, implementing checkpoint function
>     in tail also can help.
>
>     Having monitoring system to detecting failure is very important as
>     well, so that you can notice failure and do recovering reaction
>     quickly.
>
>     -JS
>
>
>     On 2/4/13 4:27 PM, 周梦想 wrote:
>>     Hi JS,
>>     We can't accept agentBESink. Because this logs are important for
>>     data analysis, we can't make any errors of the data. losing data,
>>     duplication are all not acceptable.
>>     one agent's configure is :
>>     tail("H:/game.log", startFromEnd=true) 	agentSink("hadoop48", 35853)
>>
>>
>>     every time this windows agent restart, it will resend all the
>>     data to collector server.
>>     if some reason we restart the agent node, we can't get the mark
>>     of log where the agent have sent.
>>
>>
>>     2013/1/29 Jeong-shik Jang <jsjang@gruter.com
>>     <ma...@gruter.com>>
>>
>>         Hi Andy,
>>
>>         As you set startFromEnd option true, resend might be caused
>>         by DFO mechanism (agentDFOSink); when you restart flume node
>>         in DFO mode, all events in different stages(logged, writing,
>>         sending and so on) rolls back to logged stage, which means
>>         resending and duplication.
>>
>>         And, for better performance, you may want to use agentBESink
>>         instead of agentDFOSink.
>>         I recommend to use agentBEChain for failover in case of
>>         failure in collector tier if you have multiple collectors.
>>
>>         -JS
>>
>>
>>         On 1/29/13 4:29 PM, Alexander Alten-Lorenz wrote:
>>
>>             Hi,
>>
>>             you could use tail -F, but this depends on the external
>>             source. Flume hasn't control about. You can write your
>>             own script and include this.
>>
>>             What's the content of:
>>             /tmp/flume/agent/agent*.*/ - directories? Are sent and
>>             sending clean?
>>
>>             - Alex
>>
>>             On Jan 29, 2013, at 8:24 AM, 周梦想 <ablozhou@gmail.com
>>             <ma...@gmail.com>> wrote:
>>
>>                 hello,
>>                 1. I want to tail a log source and write it to hdfs.
>>                 below is configure：
>>                 config [ag1,
>>                 tail("/home/zhouhh/game.log",startFromEnd=true),
>>                 agentDFOSink("hadoop48",35853) ;]
>>                 config [ag2,
>>                 tail("/home/zhouhh/game.log",startFromEnd=true),
>>                 agentDFOSink("hadoop48",35853) ;]
>>                 config [co1, collectorSource( 35853 ), [collectorSink(
>>                 "hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
>>                 "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]
>>
>>
>>                 I found if I restart the agent node, it will resend
>>                 the content of game.log
>>                 to collector. There are some solutions to send logs
>>                 from where I haven't
>>                 sent before? Or I have to make a mark myself or
>>                 remove the logs manually
>>                 when restart the agent node?
>>
>>                 2. I tested performance of flume, and found it's a
>>                 bit slow.
>>                 if I using configure as above, there are only
>>                 50MB/minute.
>>                 I changed the configure to below:
>>                 ag1:tail("/home/zhouhh/game.log",startFromEnd=true)|batch(1000)
>>                 gzip
>>                 agentDFOSink("hadoop48",35853);
>>
>>                 config [co1, collectorSource( 35853 ), [collectorSink(
>>                 "hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
>>                 "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]
>>
>>                 I sent 300MB log, it will spent about 3 minutes, so
>>                 it's about 100MB/minute.
>>
>>                 while I send the log from ag1 to co1 via scp, It's
>>                 about 30MB/second.
>>
>>                 someone give me any ideas?
>>
>>                 thanks!
>>
>>                 Andy
>>
>>             --
>>             Alexander Alten-Lorenz
>>             http://mapredit.blogspot.com
>>             German Hadoop LinkedIn Group: http://goo.gl/N8pCF
>>
>>
>>
>>
>>
>>         -- 
>>         Jeong-shik Jang / jsjang@gruter.com <ma...@gruter.com>
>>         Gruter, Inc., R&D Team Leader
>>         www.gruter.com <http://www.gruter.com>
>>         Enjoy Connecting
>>
>>
>
>
>     -- 
>     Jeong-shik Jang / jsjang@gruter.com <ma...@gruter.com>
>     Gruter, Inc., R&D Team Leader
>     www.gruter.com <http://www.gruter.com>
>     Enjoy Connecting
>
>


-- 
Jeong-shik Jang / jsjang@gruter.com
Gruter, Inc., R&D Team Leader
www.gruter.com
Enjoy Connecting

Re: flume tail source problem and performance

Posted by 周梦想 <ab...@gmail.com>.

Hi JS，

Thank you for your reply. So there is big shortness of collect log using
flume. can I write my own agent to send logs  via  thrift protocol directly
to collector server?

Best Regards,
Andy Zhou


2013/2/4 Jeong-shik Jang <js...@gruter.com>

>  Hi Andy,
>
> 1. "startFromEnd=true" in your source configuration means data missing can
> happen at restart in tail side because flume will ignore any data event
> generated during restart and start at the end all the time.
> 2. With agentSink, data duplication can happen due to ack delay from
> master or at agent restart.
>
> I think it is why Flume-NG doesn't support tail any more but does let user
> handle using script or program; tailing is a tricky job.
>
> My suggestion is to use agentBEChain in agent tier, and DFO in collector
> tier; you can still lose some data during failover at failure.
> To minimize loss and duplication, implementing checkpoint function in tail
> also can help.
>
> Having monitoring system to detecting failure is very important as well,
> so that you can notice failure and do recovering reaction quickly.
>
> -JS
>
>
> On 2/4/13 4:27 PM, 周梦想 wrote:
>
> Hi  JS,
> We can't accept agentBESink. Because this logs are important for data
> analysis, we can't make any errors of the data. losing data, duplication
> are all not acceptable.
> one agent's configure is :   tail("H:/game.log", startFromEnd=true) agentSink("hadoop48",
> 35853)
> every time this windows agent restart, it will resend all the data to
> collector server.
> if some reason we restart the agent node, we can't get the mark of log
> where the agent have sent.
>
>
> 2013/1/29 Jeong-shik Jang <js...@gruter.com>
>
>> Hi Andy,
>>
>> As you set startFromEnd option true, resend might be caused by DFO
>> mechanism (agentDFOSink); when you restart flume node in DFO mode, all
>> events in different stages(logged, writing, sending and so on) rolls back
>> to logged stage, which means resending and duplication.
>>
>> And, for better performance, you may want to use agentBESink instead of
>> agentDFOSink.
>> I recommend to use agentBEChain for failover in case of failure in
>> collector tier if you have multiple collectors.
>>
>> -JS
>>
>>
>> On 1/29/13 4:29 PM, Alexander Alten-Lorenz wrote:
>>
>>> Hi,
>>>
>>> you could use tail -F, but this depends on the external source. Flume
>>> hasn't control about. You can write your own script and include this.
>>>
>>> What's the content of:
>>> /tmp/flume/agent/agent*.*/ - directories? Are sent and sending clean?
>>>
>>> - Alex
>>>
>>> On Jan 29, 2013, at 8:24 AM, 周梦想 <ab...@gmail.com> wrote:
>>>
>>>  hello,
>>>> 1. I want to tail a log source and write it to hdfs. below is configure：
>>>> config [ag1, tail("/home/zhouhh/game.log",startFromEnd=true),
>>>> agentDFOSink("hadoop48",35853) ;]
>>>> config [ag2, tail("/home/zhouhh/game.log",startFromEnd=true),
>>>> agentDFOSink("hadoop48",35853) ;]
>>>> config [co1, collectorSource( 35853 ),  [collectorSink(
>>>>
>>>> "hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
>>>> "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]
>>>>
>>>>
>>>> I found if I restart the agent node, it will resend the content of
>>>> game.log
>>>> to collector. There are some solutions to send logs from where I haven't
>>>> sent before? Or I have to make a mark myself or remove the logs manually
>>>> when restart the agent node?
>>>>
>>>> 2. I tested performance of flume, and found it's a bit slow.
>>>> if I using configure as above, there are only 50MB/minute.
>>>> I changed the configure to below:
>>>> ag1:tail("/home/zhouhh/game.log",startFromEnd=true)|batch(1000) gzip
>>>> agentDFOSink("hadoop48",35853);
>>>>
>>>> config [co1, collectorSource( 35853 ), [collectorSink(
>>>>
>>>> "hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
>>>> "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]
>>>>
>>>> I sent 300MB log, it will spent about 3 minutes, so it's about
>>>> 100MB/minute.
>>>>
>>>> while I send the log from ag1 to co1 via scp, It's about 30MB/second.
>>>>
>>>> someone give me any ideas?
>>>>
>>>> thanks!
>>>>
>>>> Andy
>>>>
>>> --
>>> Alexander Alten-Lorenz
>>> http://mapredit.blogspot.com
>>> German Hadoop LinkedIn Group: http://goo.gl/N8pCF
>>>
>>>
>>>
>>>
>>
>>   --
>> Jeong-shik Jang / jsjang@gruter.com
>> Gruter, Inc., R&D Team Leader
>> www.gruter.com
>> Enjoy Connecting
>>
>>
>
>
> --
> Jeong-shik Jang / jsjang@gruter.com
> Gruter, Inc., R&D Team Leaderwww.gruter.com
> Enjoy Connecting
>
>

Re: flume tail source problem and performance

Posted by Jeong-shik Jang <js...@gruter.com>.

Hi Andy,

1. "startFromEnd=true" in your source configuration means data missing 
can happen at restart in tail side because flume will ignore any data 
event generated during restart and start at the end all the time.
2. With agentSink, data duplication can happen due to ack delay from 
master or at agent restart.

I think it is why Flume-NG doesn't support tail any more but does let 
user handle using script or program; tailing is a tricky job.

My suggestion is to use agentBEChain in agent tier, and DFO in collector 
tier; you can still lose some data during failover at failure.
To minimize loss and duplication, implementing checkpoint function in 
tail also can help.

Having monitoring system to detecting failure is very important as well, 
so that you can notice failure and do recovering reaction quickly.

-JS

On 2/4/13 4:27 PM, 周梦想 wrote:
> Hi  JS,
> We can't accept agentBESink. Because this logs are important for data 
> analysis, we can't make any errors of the data. losing data, 
> duplication are all not acceptable.
> one agent's configure is :
> tail("H:/game.log", startFromEnd=true) 	agentSink("hadoop48", 35853)
>
>
> every time this windows agent restart, it will resend all the data to 
> collector server.
> if some reason we restart the agent node, we can't get the mark of log 
> where the agent have sent.
>
>
> 2013/1/29 Jeong-shik Jang <jsjang@gruter.com <ma...@gruter.com>>
>
>     Hi Andy,
>
>     As you set startFromEnd option true, resend might be caused by DFO
>     mechanism (agentDFOSink); when you restart flume node in DFO mode,
>     all events in different stages(logged, writing, sending and so on)
>     rolls back to logged stage, which means resending and duplication.
>
>     And, for better performance, you may want to use agentBESink
>     instead of agentDFOSink.
>     I recommend to use agentBEChain for failover in case of failure in
>     collector tier if you have multiple collectors.
>
>     -JS
>
>
>     On 1/29/13 4:29 PM, Alexander Alten-Lorenz wrote:
>
>         Hi,
>
>         you could use tail -F, but this depends on the external
>         source. Flume hasn't control about. You can write your own
>         script and include this.
>
>         What's the content of:
>         /tmp/flume/agent/agent*.*/ - directories? Are sent and sending
>         clean?
>
>         - Alex
>
>         On Jan 29, 2013, at 8:24 AM, 周梦想 <ablozhou@gmail.com
>         <ma...@gmail.com>> wrote:
>
>             hello,
>             1. I want to tail a log source and write it to hdfs. below
>             is configure：
>             config [ag1, tail("/home/zhouhh/game.log",startFromEnd=true),
>             agentDFOSink("hadoop48",35853) ;]
>             config [ag2, tail("/home/zhouhh/game.log",startFromEnd=true),
>             agentDFOSink("hadoop48",35853) ;]
>             config [co1, collectorSource( 35853 ),  [collectorSink(
>             "hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
>             "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]
>
>
>             I found if I restart the agent node, it will resend the
>             content of game.log
>             to collector. There are some solutions to send logs from
>             where I haven't
>             sent before? Or I have to make a mark myself or remove the
>             logs manually
>             when restart the agent node?
>
>             2. I tested performance of flume, and found it's a bit slow.
>             if I using configure as above, there are only 50MB/minute.
>             I changed the configure to below:
>             ag1:tail("/home/zhouhh/game.log",startFromEnd=true)|batch(1000)
>             gzip
>             agentDFOSink("hadoop48",35853);
>
>             config [co1, collectorSource( 35853 ), [collectorSink(
>             "hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
>             "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]
>
>             I sent 300MB log, it will spent about 3 minutes, so it's
>             about 100MB/minute.
>
>             while I send the log from ag1 to co1 via scp, It's about
>             30MB/second.
>
>             someone give me any ideas?
>
>             thanks!
>
>             Andy
>
>         --
>         Alexander Alten-Lorenz
>         http://mapredit.blogspot.com
>         German Hadoop LinkedIn Group: http://goo.gl/N8pCF
>
>
>
>
>
>     -- 
>     Jeong-shik Jang / jsjang@gruter.com <ma...@gruter.com>
>     Gruter, Inc., R&D Team Leader
>     www.gruter.com <http://www.gruter.com>
>     Enjoy Connecting
>
>


-- 
Jeong-shik Jang / jsjang@gruter.com
Gruter, Inc., R&D Team Leader
www.gruter.com
Enjoy Connecting

Re: flume tail source problem and performance

Posted by 周梦想 <ab...@gmail.com>.

Hi  JS,
We can't accept agentBESink. Because this logs are important for data
analysis, we can't make any errors of the data. losing data, duplication
are all not acceptable.
one agent's configure is :  tail("H:/game.log",
startFromEnd=true)agentSink("hadoop48",
35853)
every time this windows agent restart, it will resend all the data to
collector server.
if some reason we restart the agent node, we can't get the mark of log
where the agent have sent.


2013/1/29 Jeong-shik Jang <js...@gruter.com>

> Hi Andy,
>
> As you set startFromEnd option true, resend might be caused by DFO
> mechanism (agentDFOSink); when you restart flume node in DFO mode, all
> events in different stages(logged, writing, sending and so on) rolls back
> to logged stage, which means resending and duplication.
>
> And, for better performance, you may want to use agentBESink instead of
> agentDFOSink.
> I recommend to use agentBEChain for failover in case of failure in
> collector tier if you have multiple collectors.
>
> -JS
>
>
> On 1/29/13 4:29 PM, Alexander Alten-Lorenz wrote:
>
>> Hi,
>>
>> you could use tail -F, but this depends on the external source. Flume
>> hasn't control about. You can write your own script and include this.
>>
>> What's the content of:
>> /tmp/flume/agent/agent*.*/ - directories? Are sent and sending clean?
>>
>> - Alex
>>
>> On Jan 29, 2013, at 8:24 AM, 周梦想 <ab...@gmail.com> wrote:
>>
>>  hello,
>>> 1. I want to tail a log source and write it to hdfs. below is configure：
>>> config [ag1, tail("/home/zhouhh/game.log",**startFromEnd=true),
>>> agentDFOSink("hadoop48",35853) ;]
>>> config [ag2, tail("/home/zhouhh/game.log",**startFromEnd=true),
>>> agentDFOSink("hadoop48",35853) ;]
>>> config [co1, collectorSource( 35853 ),  [collectorSink(
>>> "hdfs://hadoop48:54310/user/**flume/%y%m/%d","%{host}-",**
>>> 5000,raw),collectorSink(
>>> "hdfs://hadoop48:54310/user/**flume/%y%m","%{host}-",10000,**raw)]]
>>>
>>>
>>> I found if I restart the agent node, it will resend the content of
>>> game.log
>>> to collector. There are some solutions to send logs from where I haven't
>>> sent before? Or I have to make a mark myself or remove the logs manually
>>> when restart the agent node?
>>>
>>> 2. I tested performance of flume, and found it's a bit slow.
>>> if I using configure as above, there are only 50MB/minute.
>>> I changed the configure to below:
>>> ag1:tail("/home/zhouhh/game.**log",startFromEnd=true)|batch(**1000) gzip
>>> agentDFOSink("hadoop48",35853)**;
>>>
>>> config [co1, collectorSource( 35853 ), [collectorSink(
>>> "hdfs://hadoop48:54310/user/**flume/%y%m/%d","%{host}-",**
>>> 5000,raw),collectorSink(
>>> "hdfs://hadoop48:54310/user/**flume/%y%m","%{host}-",10000,**raw)]]
>>>
>>> I sent 300MB log, it will spent about 3 minutes, so it's about
>>> 100MB/minute.
>>>
>>> while I send the log from ag1 to co1 via scp, It's about 30MB/second.
>>>
>>> someone give me any ideas?
>>>
>>> thanks!
>>>
>>> Andy
>>>
>> --
>> Alexander Alten-Lorenz
>> http://mapredit.blogspot.com
>> German Hadoop LinkedIn Group: http://goo.gl/N8pCF
>>
>>
>>
>>
>
> --
> Jeong-shik Jang / jsjang@gruter.com
> Gruter, Inc., R&D Team Leader
> www.gruter.com
> Enjoy Connecting
>
>

Re: flume tail source problem and performance

Posted by Jeong-shik Jang <js...@gruter.com>.

Hi Andy,

As you set startFromEnd option true, resend might be caused by DFO 
mechanism (agentDFOSink); when you restart flume node in DFO mode, all 
events in different stages(logged, writing, sending and so on) rolls 
back to logged stage, which means resending and duplication.

And, for better performance, you may want to use agentBESink instead of 
agentDFOSink.
I recommend to use agentBEChain for failover in case of failure in 
collector tier if you have multiple collectors.

-JS

On 1/29/13 4:29 PM, Alexander Alten-Lorenz wrote:
> Hi,
>
> you could use tail -F, but this depends on the external source. Flume hasn't control about. You can write your own script and include this.
>
> What's the content of:
> /tmp/flume/agent/agent*.*/ - directories? Are sent and sending clean?
>
> - Alex
>
> On Jan 29, 2013, at 8:24 AM, 周梦想 <ab...@gmail.com> wrote:
>
>> hello,
>> 1. I want to tail a log source and write it to hdfs. below is configure：
>> config [ag1, tail("/home/zhouhh/game.log",startFromEnd=true),
>> agentDFOSink("hadoop48",35853) ;]
>> config [ag2, tail("/home/zhouhh/game.log",startFromEnd=true),
>> agentDFOSink("hadoop48",35853) ;]
>> config [co1, collectorSource( 35853 ),  [collectorSink(
>> "hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
>> "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]
>>
>>
>> I found if I restart the agent node, it will resend the content of game.log
>> to collector. There are some solutions to send logs from where I haven't
>> sent before? Or I have to make a mark myself or remove the logs manually
>> when restart the agent node?
>>
>> 2. I tested performance of flume, and found it's a bit slow.
>> if I using configure as above, there are only 50MB/minute.
>> I changed the configure to below:
>> ag1:tail("/home/zhouhh/game.log",startFromEnd=true)|batch(1000) gzip
>> agentDFOSink("hadoop48",35853);
>>
>> config [co1, collectorSource( 35853 ), [collectorSink(
>> "hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
>> "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]
>>
>> I sent 300MB log, it will spent about 3 minutes, so it's about 100MB/minute.
>>
>> while I send the log from ag1 to co1 via scp, It's about 30MB/second.
>>
>> someone give me any ideas?
>>
>> thanks!
>>
>> Andy
> --
> Alexander Alten-Lorenz
> http://mapredit.blogspot.com
> German Hadoop LinkedIn Group: http://goo.gl/N8pCF
>
>
>


-- 
Jeong-shik Jang / jsjang@gruter.com
Gruter, Inc., R&D Team Leader
www.gruter.com
Enjoy Connecting

Re: flume tail source problem and performance

Posted by Alexander Alten-Lorenz <wg...@gmail.com>.

Hi,

you could use tail -F, but this depends on the external source. Flume hasn't control about. You can write your own script and include this.

What's the content of:
/tmp/flume/agent/agent*.*/ - directories? Are sent and sending clean?

- Alex

On Jan 29, 2013, at 8:24 AM, 周梦想 <ab...@gmail.com> wrote:

> hello,
> 1. I want to tail a log source and write it to hdfs. below is configure：
> config [ag1, tail("/home/zhouhh/game.log",startFromEnd=true),
> agentDFOSink("hadoop48",35853) ;]
> config [ag2, tail("/home/zhouhh/game.log",startFromEnd=true),
> agentDFOSink("hadoop48",35853) ;]
> config [co1, collectorSource( 35853 ),  [collectorSink(
> "hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
> "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]
> 
> 
> I found if I restart the agent node, it will resend the content of game.log
> to collector. There are some solutions to send logs from where I haven't
> sent before? Or I have to make a mark myself or remove the logs manually
> when restart the agent node?
> 
> 2. I tested performance of flume, and found it's a bit slow.
> if I using configure as above, there are only 50MB/minute.
> I changed the configure to below:
> ag1:tail("/home/zhouhh/game.log",startFromEnd=true)|batch(1000) gzip
> agentDFOSink("hadoop48",35853);
> 
> config [co1, collectorSource( 35853 ), [collectorSink(
> "hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
> "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]
> 
> I sent 300MB log, it will spent about 3 minutes, so it's about 100MB/minute.
> 
> while I send the log from ag1 to co1 via scp, It's about 30MB/second.
> 
> someone give me any ideas?
> 
> thanks!
> 
> Andy

--
Alexander Alten-Lorenz
http://mapredit.blogspot.com
German Hadoop LinkedIn Group: http://goo.gl/N8pCF