You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Public Network Services <pu...@gmail.com> on 2013/08/05 19:58:57 UTC

Large-scale collection of logs from multiple Hadoop nodes

Hi...

I am facing a large-scale usage scenario of log collection from a Hadoop
cluster and examining ways as to how it should be implemented.

More specifically, imagine a cluster that has hundreds of nodes, each of
which constantly produces Syslog events that need to be gathered an
analyzed at another point. The total amount of logs could be tens of
gigabytes per day, if not more, and the reception rate in the order of
thousands of events per second, if not more.

One solution is to send those events over the network (e.g., using using
flume) and collect them in one or more (less than 5) nodes in the cluster,
or in another location, whereby the logs will be processed by a either
constantly MapReduce job, or by non-Hadoop servers running some log
processing application.

Another approach could be to deposit all these events into a queuing system
like ActiveMQ or RabbitMQ, or whatever.

In all cases, the main objective is to be able to do real-time log analysis.

What would be the best way of implementing the above scenario?

Thanks!

PNS

Re: Large-scale collection of logs from multiple Hadoop nodes

Posted by Harsh J <ha...@cloudera.com>.

Andrei's flume interceptor mention reminds me of James Kinley's Top-N
example on his flume-interceptor-analytics GH repo at
https://github.com/jrkinley/flume-interceptor-analytics#the-streaming-topn-example

On Tue, Aug 6, 2013 at 11:41 AM, Andrei <fa...@gmail.com> wrote:
> We have similar requirements and build our log collection system around
> RSyslog and Flume. It is not in production yet, but tests so far look pretty
> well. We rejected idea of using AMQP since it introduces large overhead for
> log events.
>
> Probably you can use Flume interceptors to do real-time processing on your
> events, though I haven't tried anything like that earlier. Alternatively,
> you can use Twitter Storm to handle your logs. Anyway, I wouldn't recommend
> using Hadoop MapReduce for real-time processing of logs, and there's at
> least one important reason for this.
>
> As you probably know, Flume sources obtains new event and put it into
> channel, where sink then pulls it from. If we are talking about HDFS Sink,
> it has pull interval (normally time, but you can also use total size of
> events in channel). If this interval is large, you won't get real-time
> processing. And if it is small, Flume will produce large number of small
> files in HDFS, say, of size 10-100KB. However, HDFS cannot store multiple
> files in a single block, and minimal block size is 64M, so each of your
> 10-100KB of logs will become 64M (multiplied by # of replicas!).
>
> Of course, you can use some ad-hoc solution like deleting small files from
> time to time or combining them into a larger file, but monitoring of such a
> system becomes much harder and may lead to unexpected results. So,
> processing log events before they get to HDFS seems to be better idea.
>
>
>
> On Tue, Aug 6, 2013 at 7:54 AM, Inder Pall <in...@gmail.com> wrote:
>>
>> We have been using a flume like system for such usecases at significantly
>> large scale and it has been working quite well.
>>
>> Would like to hear thoughts/challenges around using zeromq alike systems
>> at good enough scale.
>>
>> inder
>> "you are the average of 5 people you spend the most time with"
>>
>> On Aug 5, 2013 11:29 PM, "Public Network Services"
>> <pu...@gmail.com> wrote:
>>>
>>> Hi...
>>>
>>> I am facing a large-scale usage scenario of log collection from a Hadoop
>>> cluster and examining ways as to how it should be implemented.
>>>
>>> More specifically, imagine a cluster that has hundreds of nodes, each of
>>> which constantly produces Syslog events that need to be gathered an analyzed
>>> at another point. The total amount of logs could be tens of gigabytes per
>>> day, if not more, and the reception rate in the order of thousands of events
>>> per second, if not more.
>>>
>>> One solution is to send those events over the network (e.g., using using
>>> flume) and collect them in one or more (less than 5) nodes in the cluster,
>>> or in another location, whereby the logs will be processed by a either
>>> constantly MapReduce job, or by non-Hadoop servers running some log
>>> processing application.
>>>
>>> Another approach could be to deposit all these events into a queuing
>>> system like ActiveMQ or RabbitMQ, or whatever.
>>>
>>> In all cases, the main objective is to be able to do real-time log
>>> analysis.
>>>
>>> What would be the best way of implementing the above scenario?
>>>
>>> Thanks!
>>>
>>> PNS
>>>
>



-- 
Harsh J

Re: Large-scale collection of logs from multiple Hadoop nodes

Posted by Harsh J <ha...@cloudera.com>.

Andrei's flume interceptor mention reminds me of James Kinley's Top-N
example on his flume-interceptor-analytics GH repo at
https://github.com/jrkinley/flume-interceptor-analytics#the-streaming-topn-example

On Tue, Aug 6, 2013 at 11:41 AM, Andrei <fa...@gmail.com> wrote:
> We have similar requirements and build our log collection system around
> RSyslog and Flume. It is not in production yet, but tests so far look pretty
> well. We rejected idea of using AMQP since it introduces large overhead for
> log events.
>
> Probably you can use Flume interceptors to do real-time processing on your
> events, though I haven't tried anything like that earlier. Alternatively,
> you can use Twitter Storm to handle your logs. Anyway, I wouldn't recommend
> using Hadoop MapReduce for real-time processing of logs, and there's at
> least one important reason for this.
>
> As you probably know, Flume sources obtains new event and put it into
> channel, where sink then pulls it from. If we are talking about HDFS Sink,
> it has pull interval (normally time, but you can also use total size of
> events in channel). If this interval is large, you won't get real-time
> processing. And if it is small, Flume will produce large number of small
> files in HDFS, say, of size 10-100KB. However, HDFS cannot store multiple
> files in a single block, and minimal block size is 64M, so each of your
> 10-100KB of logs will become 64M (multiplied by # of replicas!).
>
> Of course, you can use some ad-hoc solution like deleting small files from
> time to time or combining them into a larger file, but monitoring of such a
> system becomes much harder and may lead to unexpected results. So,
> processing log events before they get to HDFS seems to be better idea.
>
>
>
> On Tue, Aug 6, 2013 at 7:54 AM, Inder Pall <in...@gmail.com> wrote:
>>
>> We have been using a flume like system for such usecases at significantly
>> large scale and it has been working quite well.
>>
>> Would like to hear thoughts/challenges around using zeromq alike systems
>> at good enough scale.
>>
>> inder
>> "you are the average of 5 people you spend the most time with"
>>
>> On Aug 5, 2013 11:29 PM, "Public Network Services"
>> <pu...@gmail.com> wrote:
>>>
>>> Hi...
>>>
>>> I am facing a large-scale usage scenario of log collection from a Hadoop
>>> cluster and examining ways as to how it should be implemented.
>>>
>>> More specifically, imagine a cluster that has hundreds of nodes, each of
>>> which constantly produces Syslog events that need to be gathered an analyzed
>>> at another point. The total amount of logs could be tens of gigabytes per
>>> day, if not more, and the reception rate in the order of thousands of events
>>> per second, if not more.
>>>
>>> One solution is to send those events over the network (e.g., using using
>>> flume) and collect them in one or more (less than 5) nodes in the cluster,
>>> or in another location, whereby the logs will be processed by a either
>>> constantly MapReduce job, or by non-Hadoop servers running some log
>>> processing application.
>>>
>>> Another approach could be to deposit all these events into a queuing
>>> system like ActiveMQ or RabbitMQ, or whatever.
>>>
>>> In all cases, the main objective is to be able to do real-time log
>>> analysis.
>>>
>>> What would be the best way of implementing the above scenario?
>>>
>>> Thanks!
>>>
>>> PNS
>>>
>



-- 
Harsh J

Re: Large-scale collection of logs from multiple Hadoop nodes

Posted by Harsh J <ha...@cloudera.com>.

Andrei's flume interceptor mention reminds me of James Kinley's Top-N
example on his flume-interceptor-analytics GH repo at
https://github.com/jrkinley/flume-interceptor-analytics#the-streaming-topn-example

On Tue, Aug 6, 2013 at 11:41 AM, Andrei <fa...@gmail.com> wrote:
> We have similar requirements and build our log collection system around
> RSyslog and Flume. It is not in production yet, but tests so far look pretty
> well. We rejected idea of using AMQP since it introduces large overhead for
> log events.
>
> Probably you can use Flume interceptors to do real-time processing on your
> events, though I haven't tried anything like that earlier. Alternatively,
> you can use Twitter Storm to handle your logs. Anyway, I wouldn't recommend
> using Hadoop MapReduce for real-time processing of logs, and there's at
> least one important reason for this.
>
> As you probably know, Flume sources obtains new event and put it into
> channel, where sink then pulls it from. If we are talking about HDFS Sink,
> it has pull interval (normally time, but you can also use total size of
> events in channel). If this interval is large, you won't get real-time
> processing. And if it is small, Flume will produce large number of small
> files in HDFS, say, of size 10-100KB. However, HDFS cannot store multiple
> files in a single block, and minimal block size is 64M, so each of your
> 10-100KB of logs will become 64M (multiplied by # of replicas!).
>
> Of course, you can use some ad-hoc solution like deleting small files from
> time to time or combining them into a larger file, but monitoring of such a
> system becomes much harder and may lead to unexpected results. So,
> processing log events before they get to HDFS seems to be better idea.
>
>
>
> On Tue, Aug 6, 2013 at 7:54 AM, Inder Pall <in...@gmail.com> wrote:
>>
>> We have been using a flume like system for such usecases at significantly
>> large scale and it has been working quite well.
>>
>> Would like to hear thoughts/challenges around using zeromq alike systems
>> at good enough scale.
>>
>> inder
>> "you are the average of 5 people you spend the most time with"
>>
>> On Aug 5, 2013 11:29 PM, "Public Network Services"
>> <pu...@gmail.com> wrote:
>>>
>>> Hi...
>>>
>>> I am facing a large-scale usage scenario of log collection from a Hadoop
>>> cluster and examining ways as to how it should be implemented.
>>>
>>> More specifically, imagine a cluster that has hundreds of nodes, each of
>>> which constantly produces Syslog events that need to be gathered an analyzed
>>> at another point. The total amount of logs could be tens of gigabytes per
>>> day, if not more, and the reception rate in the order of thousands of events
>>> per second, if not more.
>>>
>>> One solution is to send those events over the network (e.g., using using
>>> flume) and collect them in one or more (less than 5) nodes in the cluster,
>>> or in another location, whereby the logs will be processed by a either
>>> constantly MapReduce job, or by non-Hadoop servers running some log
>>> processing application.
>>>
>>> Another approach could be to deposit all these events into a queuing
>>> system like ActiveMQ or RabbitMQ, or whatever.
>>>
>>> In all cases, the main objective is to be able to do real-time log
>>> analysis.
>>>
>>> What would be the best way of implementing the above scenario?
>>>
>>> Thanks!
>>>
>>> PNS
>>>
>



-- 
Harsh J

Re: Large-scale collection of logs from multiple Hadoop nodes

Posted by Harsh J <ha...@cloudera.com>.

Andrei's flume interceptor mention reminds me of James Kinley's Top-N
example on his flume-interceptor-analytics GH repo at
https://github.com/jrkinley/flume-interceptor-analytics#the-streaming-topn-example

On Tue, Aug 6, 2013 at 11:41 AM, Andrei <fa...@gmail.com> wrote:
> We have similar requirements and build our log collection system around
> RSyslog and Flume. It is not in production yet, but tests so far look pretty
> well. We rejected idea of using AMQP since it introduces large overhead for
> log events.
>
> Probably you can use Flume interceptors to do real-time processing on your
> events, though I haven't tried anything like that earlier. Alternatively,
> you can use Twitter Storm to handle your logs. Anyway, I wouldn't recommend
> using Hadoop MapReduce for real-time processing of logs, and there's at
> least one important reason for this.
>
> As you probably know, Flume sources obtains new event and put it into
> channel, where sink then pulls it from. If we are talking about HDFS Sink,
> it has pull interval (normally time, but you can also use total size of
> events in channel). If this interval is large, you won't get real-time
> processing. And if it is small, Flume will produce large number of small
> files in HDFS, say, of size 10-100KB. However, HDFS cannot store multiple
> files in a single block, and minimal block size is 64M, so each of your
> 10-100KB of logs will become 64M (multiplied by # of replicas!).
>
> Of course, you can use some ad-hoc solution like deleting small files from
> time to time or combining them into a larger file, but monitoring of such a
> system becomes much harder and may lead to unexpected results. So,
> processing log events before they get to HDFS seems to be better idea.
>
>
>
> On Tue, Aug 6, 2013 at 7:54 AM, Inder Pall <in...@gmail.com> wrote:
>>
>> We have been using a flume like system for such usecases at significantly
>> large scale and it has been working quite well.
>>
>> Would like to hear thoughts/challenges around using zeromq alike systems
>> at good enough scale.
>>
>> inder
>> "you are the average of 5 people you spend the most time with"
>>
>> On Aug 5, 2013 11:29 PM, "Public Network Services"
>> <pu...@gmail.com> wrote:
>>>
>>> Hi...
>>>
>>> I am facing a large-scale usage scenario of log collection from a Hadoop
>>> cluster and examining ways as to how it should be implemented.
>>>
>>> More specifically, imagine a cluster that has hundreds of nodes, each of
>>> which constantly produces Syslog events that need to be gathered an analyzed
>>> at another point. The total amount of logs could be tens of gigabytes per
>>> day, if not more, and the reception rate in the order of thousands of events
>>> per second, if not more.
>>>
>>> One solution is to send those events over the network (e.g., using using
>>> flume) and collect them in one or more (less than 5) nodes in the cluster,
>>> or in another location, whereby the logs will be processed by a either
>>> constantly MapReduce job, or by non-Hadoop servers running some log
>>> processing application.
>>>
>>> Another approach could be to deposit all these events into a queuing
>>> system like ActiveMQ or RabbitMQ, or whatever.
>>>
>>> In all cases, the main objective is to be able to do real-time log
>>> analysis.
>>>
>>> What would be the best way of implementing the above scenario?
>>>
>>> Thanks!
>>>
>>> PNS
>>>
>



-- 
Harsh J

Re: Large-scale collection of logs from multiple Hadoop nodes

Posted by Andrei <fa...@gmail.com>.

We have similar requirements and build our log collection system around
RSyslog and Flume. It is not in production yet, but tests so far look
pretty well. We rejected idea of using AMQP since it introduces large
overhead for log events.

Probably you can use Flume interceptors to do real-time processing on your
events, though I haven't tried anything like that earlier. Alternatively,
you can use Twitter Storm to handle your logs. Anyway, I wouldn't recommend
using Hadoop MapReduce for real-time processing of logs, and there's at
least one important reason for this.

As you probably know, Flume sources obtains new event and put it into
channel, where sink then pulls it from. If we are talking about HDFS Sink,
it has pull interval (normally time, but you can also use total size of
events in channel). If this interval is large, you won't get real-time
processing. And if it is small, Flume will produce large number of small
files in HDFS, say, of size 10-100KB. However, HDFS cannot store multiple
files in a single block, and minimal block size is 64M, so each of your
10-100KB of logs will become 64M (multiplied by # of replicas!).

Of course, you can use some ad-hoc solution like deleting small files from
time to time or combining them into a larger file, but monitoring of such a
system becomes much harder and may lead to unexpected results. So,
processing log events before they get to HDFS seems to be better idea.

On Tue, Aug 6, 2013 at 7:54 AM, Inder Pall <in...@gmail.com> wrote:

> We have been using a flume like system for such usecases at significantly
> large scale and it has been working quite well.
>
> Would like to hear thoughts/challenges around using zeromq alike systems
> at good enough scale.
>
> inder
> "you are the average of 5 people you spend the most time with"
> On Aug 5, 2013 11:29 PM, "Public Network Services" <
> publicnetworkservices@gmail.com> wrote:
>
>> Hi...
>>
>> I am facing a large-scale usage scenario of log collection from a Hadoop
>> cluster and examining ways as to how it should be implemented.
>>
>> More specifically, imagine a cluster that has hundreds of nodes, each of
>> which constantly produces Syslog events that need to be gathered an
>> analyzed at another point. The total amount of logs could be tens of
>> gigabytes per day, if not more, and the reception rate in the order of
>> thousands of events per second, if not more.
>>
>> One solution is to send those events over the network (e.g., using using
>> flume) and collect them in one or more (less than 5) nodes in the cluster,
>> or in another location, whereby the logs will be processed by a either
>> constantly MapReduce job, or by non-Hadoop servers running some log
>> processing application.
>>
>> Another approach could be to deposit all these events into a queuing
>> system like ActiveMQ or RabbitMQ, or whatever.
>>
>> In all cases, the main objective is to be able to do real-time log
>> analysis.
>>
>> What would be the best way of implementing the above scenario?
>>
>> Thanks!
>>
>> PNS
>>
>>

Re: Large-scale collection of logs from multiple Hadoop nodes

Posted by Andrei <fa...@gmail.com>.

We have similar requirements and build our log collection system around
RSyslog and Flume. It is not in production yet, but tests so far look
pretty well. We rejected idea of using AMQP since it introduces large
overhead for log events.

Probably you can use Flume interceptors to do real-time processing on your
events, though I haven't tried anything like that earlier. Alternatively,
you can use Twitter Storm to handle your logs. Anyway, I wouldn't recommend
using Hadoop MapReduce for real-time processing of logs, and there's at
least one important reason for this.

As you probably know, Flume sources obtains new event and put it into
channel, where sink then pulls it from. If we are talking about HDFS Sink,
it has pull interval (normally time, but you can also use total size of
events in channel). If this interval is large, you won't get real-time
processing. And if it is small, Flume will produce large number of small
files in HDFS, say, of size 10-100KB. However, HDFS cannot store multiple
files in a single block, and minimal block size is 64M, so each of your
10-100KB of logs will become 64M (multiplied by # of replicas!).

Of course, you can use some ad-hoc solution like deleting small files from
time to time or combining them into a larger file, but monitoring of such a
system becomes much harder and may lead to unexpected results. So,
processing log events before they get to HDFS seems to be better idea.

On Tue, Aug 6, 2013 at 7:54 AM, Inder Pall <in...@gmail.com> wrote:

> We have been using a flume like system for such usecases at significantly
> large scale and it has been working quite well.
>
> Would like to hear thoughts/challenges around using zeromq alike systems
> at good enough scale.
>
> inder
> "you are the average of 5 people you spend the most time with"
> On Aug 5, 2013 11:29 PM, "Public Network Services" <
> publicnetworkservices@gmail.com> wrote:
>
>> Hi...
>>
>> I am facing a large-scale usage scenario of log collection from a Hadoop
>> cluster and examining ways as to how it should be implemented.
>>
>> More specifically, imagine a cluster that has hundreds of nodes, each of
>> which constantly produces Syslog events that need to be gathered an
>> analyzed at another point. The total amount of logs could be tens of
>> gigabytes per day, if not more, and the reception rate in the order of
>> thousands of events per second, if not more.
>>
>> One solution is to send those events over the network (e.g., using using
>> flume) and collect them in one or more (less than 5) nodes in the cluster,
>> or in another location, whereby the logs will be processed by a either
>> constantly MapReduce job, or by non-Hadoop servers running some log
>> processing application.
>>
>> Another approach could be to deposit all these events into a queuing
>> system like ActiveMQ or RabbitMQ, or whatever.
>>
>> In all cases, the main objective is to be able to do real-time log
>> analysis.
>>
>> What would be the best way of implementing the above scenario?
>>
>> Thanks!
>>
>> PNS
>>
>>

Re: Large-scale collection of logs from multiple Hadoop nodes

Posted by Andrei <fa...@gmail.com>.

We have similar requirements and build our log collection system around
RSyslog and Flume. It is not in production yet, but tests so far look
pretty well. We rejected idea of using AMQP since it introduces large
overhead for log events.

Probably you can use Flume interceptors to do real-time processing on your
events, though I haven't tried anything like that earlier. Alternatively,
you can use Twitter Storm to handle your logs. Anyway, I wouldn't recommend
using Hadoop MapReduce for real-time processing of logs, and there's at
least one important reason for this.

As you probably know, Flume sources obtains new event and put it into
channel, where sink then pulls it from. If we are talking about HDFS Sink,
it has pull interval (normally time, but you can also use total size of
events in channel). If this interval is large, you won't get real-time
processing. And if it is small, Flume will produce large number of small
files in HDFS, say, of size 10-100KB. However, HDFS cannot store multiple
files in a single block, and minimal block size is 64M, so each of your
10-100KB of logs will become 64M (multiplied by # of replicas!).

Of course, you can use some ad-hoc solution like deleting small files from
time to time or combining them into a larger file, but monitoring of such a
system becomes much harder and may lead to unexpected results. So,
processing log events before they get to HDFS seems to be better idea.

On Tue, Aug 6, 2013 at 7:54 AM, Inder Pall <in...@gmail.com> wrote:

> We have been using a flume like system for such usecases at significantly
> large scale and it has been working quite well.
>
> Would like to hear thoughts/challenges around using zeromq alike systems
> at good enough scale.
>
> inder
> "you are the average of 5 people you spend the most time with"
> On Aug 5, 2013 11:29 PM, "Public Network Services" <
> publicnetworkservices@gmail.com> wrote:
>
>> Hi...
>>
>> I am facing a large-scale usage scenario of log collection from a Hadoop
>> cluster and examining ways as to how it should be implemented.
>>
>> More specifically, imagine a cluster that has hundreds of nodes, each of
>> which constantly produces Syslog events that need to be gathered an
>> analyzed at another point. The total amount of logs could be tens of
>> gigabytes per day, if not more, and the reception rate in the order of
>> thousands of events per second, if not more.
>>
>> One solution is to send those events over the network (e.g., using using
>> flume) and collect them in one or more (less than 5) nodes in the cluster,
>> or in another location, whereby the logs will be processed by a either
>> constantly MapReduce job, or by non-Hadoop servers running some log
>> processing application.
>>
>> Another approach could be to deposit all these events into a queuing
>> system like ActiveMQ or RabbitMQ, or whatever.
>>
>> In all cases, the main objective is to be able to do real-time log
>> analysis.
>>
>> What would be the best way of implementing the above scenario?
>>
>> Thanks!
>>
>> PNS
>>
>>

Re: Large-scale collection of logs from multiple Hadoop nodes

Posted by Andrei <fa...@gmail.com>.

We have similar requirements and build our log collection system around
RSyslog and Flume. It is not in production yet, but tests so far look
pretty well. We rejected idea of using AMQP since it introduces large
overhead for log events.

Probably you can use Flume interceptors to do real-time processing on your
events, though I haven't tried anything like that earlier. Alternatively,
you can use Twitter Storm to handle your logs. Anyway, I wouldn't recommend
using Hadoop MapReduce for real-time processing of logs, and there's at
least one important reason for this.

As you probably know, Flume sources obtains new event and put it into
channel, where sink then pulls it from. If we are talking about HDFS Sink,
it has pull interval (normally time, but you can also use total size of
events in channel). If this interval is large, you won't get real-time
processing. And if it is small, Flume will produce large number of small
files in HDFS, say, of size 10-100KB. However, HDFS cannot store multiple
files in a single block, and minimal block size is 64M, so each of your
10-100KB of logs will become 64M (multiplied by # of replicas!).

Of course, you can use some ad-hoc solution like deleting small files from
time to time or combining them into a larger file, but monitoring of such a
system becomes much harder and may lead to unexpected results. So,
processing log events before they get to HDFS seems to be better idea.

On Tue, Aug 6, 2013 at 7:54 AM, Inder Pall <in...@gmail.com> wrote:

> We have been using a flume like system for such usecases at significantly
> large scale and it has been working quite well.
>
> Would like to hear thoughts/challenges around using zeromq alike systems
> at good enough scale.
>
> inder
> "you are the average of 5 people you spend the most time with"
> On Aug 5, 2013 11:29 PM, "Public Network Services" <
> publicnetworkservices@gmail.com> wrote:
>
>> Hi...
>>
>> I am facing a large-scale usage scenario of log collection from a Hadoop
>> cluster and examining ways as to how it should be implemented.
>>
>> More specifically, imagine a cluster that has hundreds of nodes, each of
>> which constantly produces Syslog events that need to be gathered an
>> analyzed at another point. The total amount of logs could be tens of
>> gigabytes per day, if not more, and the reception rate in the order of
>> thousands of events per second, if not more.
>>
>> One solution is to send those events over the network (e.g., using using
>> flume) and collect them in one or more (less than 5) nodes in the cluster,
>> or in another location, whereby the logs will be processed by a either
>> constantly MapReduce job, or by non-Hadoop servers running some log
>> processing application.
>>
>> Another approach could be to deposit all these events into a queuing
>> system like ActiveMQ or RabbitMQ, or whatever.
>>
>> In all cases, the main objective is to be able to do real-time log
>> analysis.
>>
>> What would be the best way of implementing the above scenario?
>>
>> Thanks!
>>
>> PNS
>>
>>

Re: Large-scale collection of logs from multiple Hadoop nodes

Posted by Inder Pall <in...@gmail.com>.

We have been using a flume like system for such usecases at significantly
large scale and it has been working quite well.

Would like to hear thoughts/challenges around using zeromq alike systems at
good enough scale.

inder
"you are the average of 5 people you spend the most time with"
On Aug 5, 2013 11:29 PM, "Public Network Services" <
publicnetworkservices@gmail.com> wrote:

> Hi...
>
> I am facing a large-scale usage scenario of log collection from a Hadoop
> cluster and examining ways as to how it should be implemented.
>
> More specifically, imagine a cluster that has hundreds of nodes, each of
> which constantly produces Syslog events that need to be gathered an
> analyzed at another point. The total amount of logs could be tens of
> gigabytes per day, if not more, and the reception rate in the order of
> thousands of events per second, if not more.
>
> One solution is to send those events over the network (e.g., using using
> flume) and collect them in one or more (less than 5) nodes in the cluster,
> or in another location, whereby the logs will be processed by a either
> constantly MapReduce job, or by non-Hadoop servers running some log
> processing application.
>
> Another approach could be to deposit all these events into a queuing
> system like ActiveMQ or RabbitMQ, or whatever.
>
> In all cases, the main objective is to be able to do real-time log
> analysis.
>
> What would be the best way of implementing the above scenario?
>
> Thanks!
>
> PNS
>
>

Re: Large-scale collection of logs from multiple Hadoop nodes

Posted by Inder Pall <in...@gmail.com>.

We have been using a flume like system for such usecases at significantly
large scale and it has been working quite well.

Would like to hear thoughts/challenges around using zeromq alike systems at
good enough scale.

inder
"you are the average of 5 people you spend the most time with"
On Aug 5, 2013 11:29 PM, "Public Network Services" <
publicnetworkservices@gmail.com> wrote:

> Hi...
>
> I am facing a large-scale usage scenario of log collection from a Hadoop
> cluster and examining ways as to how it should be implemented.
>
> More specifically, imagine a cluster that has hundreds of nodes, each of
> which constantly produces Syslog events that need to be gathered an
> analyzed at another point. The total amount of logs could be tens of
> gigabytes per day, if not more, and the reception rate in the order of
> thousands of events per second, if not more.
>
> One solution is to send those events over the network (e.g., using using
> flume) and collect them in one or more (less than 5) nodes in the cluster,
> or in another location, whereby the logs will be processed by a either
> constantly MapReduce job, or by non-Hadoop servers running some log
> processing application.
>
> Another approach could be to deposit all these events into a queuing
> system like ActiveMQ or RabbitMQ, or whatever.
>
> In all cases, the main objective is to be able to do real-time log
> analysis.
>
> What would be the best way of implementing the above scenario?
>
> Thanks!
>
> PNS
>
>

Re: Large-scale collection of logs from multiple Hadoop nodes

Posted by Inder Pall <in...@gmail.com>.

We have been using a flume like system for such usecases at significantly
large scale and it has been working quite well.

Would like to hear thoughts/challenges around using zeromq alike systems at
good enough scale.

inder
"you are the average of 5 people you spend the most time with"
On Aug 5, 2013 11:29 PM, "Public Network Services" <
publicnetworkservices@gmail.com> wrote:

> Hi...
>
> I am facing a large-scale usage scenario of log collection from a Hadoop
> cluster and examining ways as to how it should be implemented.
>
> More specifically, imagine a cluster that has hundreds of nodes, each of
> which constantly produces Syslog events that need to be gathered an
> analyzed at another point. The total amount of logs could be tens of
> gigabytes per day, if not more, and the reception rate in the order of
> thousands of events per second, if not more.
>
> One solution is to send those events over the network (e.g., using using
> flume) and collect them in one or more (less than 5) nodes in the cluster,
> or in another location, whereby the logs will be processed by a either
> constantly MapReduce job, or by non-Hadoop servers running some log
> processing application.
>
> Another approach could be to deposit all these events into a queuing
> system like ActiveMQ or RabbitMQ, or whatever.
>
> In all cases, the main objective is to be able to do real-time log
> analysis.
>
> What would be the best way of implementing the above scenario?
>
> Thanks!
>
> PNS
>
>

Re: Large-scale collection of logs from multiple Hadoop nodes

Posted by Inder Pall <in...@gmail.com>.

We have been using a flume like system for such usecases at significantly
large scale and it has been working quite well.

Would like to hear thoughts/challenges around using zeromq alike systems at
good enough scale.

inder
"you are the average of 5 people you spend the most time with"
On Aug 5, 2013 11:29 PM, "Public Network Services" <
publicnetworkservices@gmail.com> wrote:

> Hi...
>
> I am facing a large-scale usage scenario of log collection from a Hadoop
> cluster and examining ways as to how it should be implemented.
>
> More specifically, imagine a cluster that has hundreds of nodes, each of
> which constantly produces Syslog events that need to be gathered an
> analyzed at another point. The total amount of logs could be tens of
> gigabytes per day, if not more, and the reception rate in the order of
> thousands of events per second, if not more.
>
> One solution is to send those events over the network (e.g., using using
> flume) and collect them in one or more (less than 5) nodes in the cluster,
> or in another location, whereby the logs will be processed by a either
> constantly MapReduce job, or by non-Hadoop servers running some log
> processing application.
>
> Another approach could be to deposit all these events into a queuing
> system like ActiveMQ or RabbitMQ, or whatever.
>
> In all cases, the main objective is to be able to do real-time log
> analysis.
>
> What would be the best way of implementing the above scenario?
>
> Thanks!
>
> PNS
>
>

Re: Large-scale collection of logs from multiple Hadoop nodes

Posted by Alexander Lorenz <wg...@gmail.com>.

Hi,

the approach with Flume is the most reliable workflow for, since Flume has a builtin Syslog source as well a loadbalancing channel. On top you can define multiple channels for different sources. 

Best,
Alex

sent via my mobile device

mapredit.blogspot.com
@mapredit


> On Aug 7, 2013, at 1:44 PM, 武泽胜 <wu...@xiaomi.com> wrote:
> 
> We have the same scenario as you described. The following is our solution, just FYI:
> 
> We installed a local scribe agent on every node of our cluster, and we have several central scribe servers. We extended log4j to support writing logs to the local scribe agent,  and the local scribe agents forward the logs to the central scribe servers, at last the central scribe servers write these logs to a specified hdfs cluster used for offline processing.
> 
> Then we use hive/impale to analyse  the collected logs.
> 
> From: Public Network Services <pu...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Tuesday, August 6, 2013 1:58 AM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Large-scale collection of logs from multiple Hadoop nodes
> 
> Hi...
> 
> I am facing a large-scale usage scenario of log collection from a Hadoop cluster and examining ways as to how it should be implemented.
> 
> More specifically, imagine a cluster that has hundreds of nodes, each of which constantly produces Syslog events that need to be gathered an analyzed at another point. The total amount of logs could be tens of gigabytes per day, if not more, and the reception rate in the order of thousands of events per second, if not more.
> 
> One solution is to send those events over the network (e.g., using using flume) and collect them in one or more (less than 5) nodes in the cluster, or in another location, whereby the logs will be processed by a either constantly MapReduce job, or by non-Hadoop servers running some log processing application.
> 
> Another approach could be to deposit all these events into a queuing system like ActiveMQ or RabbitMQ, or whatever.
> 
> In all cases, the main objective is to be able to do real-time log analysis.
> 
> What would be the best way of implementing the above scenario?
> 
> Thanks!
> 
> PNS
>

Re: Large-scale collection of logs from multiple Hadoop nodes

Posted by Alexander Lorenz <wg...@gmail.com>.

Hi,

the approach with Flume is the most reliable workflow for, since Flume has a builtin Syslog source as well a loadbalancing channel. On top you can define multiple channels for different sources. 

Best,
Alex

sent via my mobile device

mapredit.blogspot.com
@mapredit


> On Aug 7, 2013, at 1:44 PM, 武泽胜 <wu...@xiaomi.com> wrote:
> 
> We have the same scenario as you described. The following is our solution, just FYI:
> 
> We installed a local scribe agent on every node of our cluster, and we have several central scribe servers. We extended log4j to support writing logs to the local scribe agent,  and the local scribe agents forward the logs to the central scribe servers, at last the central scribe servers write these logs to a specified hdfs cluster used for offline processing.
> 
> Then we use hive/impale to analyse  the collected logs.
> 
> From: Public Network Services <pu...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Tuesday, August 6, 2013 1:58 AM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Large-scale collection of logs from multiple Hadoop nodes
> 
> Hi...
> 
> I am facing a large-scale usage scenario of log collection from a Hadoop cluster and examining ways as to how it should be implemented.
> 
> More specifically, imagine a cluster that has hundreds of nodes, each of which constantly produces Syslog events that need to be gathered an analyzed at another point. The total amount of logs could be tens of gigabytes per day, if not more, and the reception rate in the order of thousands of events per second, if not more.
> 
> One solution is to send those events over the network (e.g., using using flume) and collect them in one or more (less than 5) nodes in the cluster, or in another location, whereby the logs will be processed by a either constantly MapReduce job, or by non-Hadoop servers running some log processing application.
> 
> Another approach could be to deposit all these events into a queuing system like ActiveMQ or RabbitMQ, or whatever.
> 
> In all cases, the main objective is to be able to do real-time log analysis.
> 
> What would be the best way of implementing the above scenario?
> 
> Thanks!
> 
> PNS
>

Re: Large-scale collection of logs from multiple Hadoop nodes

Posted by Alexander Lorenz <wg...@gmail.com>.

Hi,

the approach with Flume is the most reliable workflow for, since Flume has a builtin Syslog source as well a loadbalancing channel. On top you can define multiple channels for different sources. 

Best,
Alex

sent via my mobile device

mapredit.blogspot.com
@mapredit


> On Aug 7, 2013, at 1:44 PM, 武泽胜 <wu...@xiaomi.com> wrote:
> 
> We have the same scenario as you described. The following is our solution, just FYI:
> 
> We installed a local scribe agent on every node of our cluster, and we have several central scribe servers. We extended log4j to support writing logs to the local scribe agent,  and the local scribe agents forward the logs to the central scribe servers, at last the central scribe servers write these logs to a specified hdfs cluster used for offline processing.
> 
> Then we use hive/impale to analyse  the collected logs.
> 
> From: Public Network Services <pu...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Tuesday, August 6, 2013 1:58 AM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Large-scale collection of logs from multiple Hadoop nodes
> 
> Hi...
> 
> I am facing a large-scale usage scenario of log collection from a Hadoop cluster and examining ways as to how it should be implemented.
> 
> More specifically, imagine a cluster that has hundreds of nodes, each of which constantly produces Syslog events that need to be gathered an analyzed at another point. The total amount of logs could be tens of gigabytes per day, if not more, and the reception rate in the order of thousands of events per second, if not more.
> 
> One solution is to send those events over the network (e.g., using using flume) and collect them in one or more (less than 5) nodes in the cluster, or in another location, whereby the logs will be processed by a either constantly MapReduce job, or by non-Hadoop servers running some log processing application.
> 
> Another approach could be to deposit all these events into a queuing system like ActiveMQ or RabbitMQ, or whatever.
> 
> In all cases, the main objective is to be able to do real-time log analysis.
> 
> What would be the best way of implementing the above scenario?
> 
> Thanks!
> 
> PNS
>

Re: Large-scale collection of logs from multiple Hadoop nodes

Posted by Alexander Lorenz <wg...@gmail.com>.

Hi,

the approach with Flume is the most reliable workflow for, since Flume has a builtin Syslog source as well a loadbalancing channel. On top you can define multiple channels for different sources. 

Best,
Alex

sent via my mobile device

mapredit.blogspot.com
@mapredit


> On Aug 7, 2013, at 1:44 PM, 武泽胜 <wu...@xiaomi.com> wrote:
> 
> We have the same scenario as you described. The following is our solution, just FYI:
> 
> We installed a local scribe agent on every node of our cluster, and we have several central scribe servers. We extended log4j to support writing logs to the local scribe agent,  and the local scribe agents forward the logs to the central scribe servers, at last the central scribe servers write these logs to a specified hdfs cluster used for offline processing.
> 
> Then we use hive/impale to analyse  the collected logs.
> 
> From: Public Network Services <pu...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Tuesday, August 6, 2013 1:58 AM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Large-scale collection of logs from multiple Hadoop nodes
> 
> Hi...
> 
> I am facing a large-scale usage scenario of log collection from a Hadoop cluster and examining ways as to how it should be implemented.
> 
> More specifically, imagine a cluster that has hundreds of nodes, each of which constantly produces Syslog events that need to be gathered an analyzed at another point. The total amount of logs could be tens of gigabytes per day, if not more, and the reception rate in the order of thousands of events per second, if not more.
> 
> One solution is to send those events over the network (e.g., using using flume) and collect them in one or more (less than 5) nodes in the cluster, or in another location, whereby the logs will be processed by a either constantly MapReduce job, or by non-Hadoop servers running some log processing application.
> 
> Another approach could be to deposit all these events into a queuing system like ActiveMQ or RabbitMQ, or whatever.
> 
> In all cases, the main objective is to be able to do real-time log analysis.
> 
> What would be the best way of implementing the above scenario?
> 
> Thanks!
> 
> PNS
>

Re: Large-scale collection of logs from multiple Hadoop nodes

Posted by 武泽胜 <wu...@xiaomi.com>.

We have the same scenario as you described. The following is our solution, just FYI:

We installed a local scribe agent on every node of our cluster, and we have several central scribe servers. We extended log4j to support writing logs to the local scribe agent,  and the local scribe agents forward the logs to the central scribe servers, at last the central scribe servers write these logs to a specified hdfs cluster used for offline processing.

Then we use hive/impale to analyse  the collected logs.

From: Public Network Services <pu...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Tuesday, August 6, 2013 1:58 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Large-scale collection of logs from multiple Hadoop nodes

Hi...

I am facing a large-scale usage scenario of log collection from a Hadoop cluster and examining ways as to how it should be implemented.

More specifically, imagine a cluster that has hundreds of nodes, each of which constantly produces Syslog events that need to be gathered an analyzed at another point. The total amount of logs could be tens of gigabytes per day, if not more, and the reception rate in the order of thousands of events per second, if not more.

One solution is to send those events over the network (e.g., using using flume) and collect them in one or more (less than 5) nodes in the cluster, or in another location, whereby the logs will be processed by a either constantly MapReduce job, or by non-Hadoop servers running some log processing application.

Another approach could be to deposit all these events into a queuing system like ActiveMQ or RabbitMQ, or whatever.

In all cases, the main objective is to be able to do real-time log analysis.

What would be the best way of implementing the above scenario?

Thanks!

PNS

Re: Large-scale collection of logs from multiple Hadoop nodes

Posted by 武泽胜 <wu...@xiaomi.com>.

We have the same scenario as you described. The following is our solution, just FYI:

We installed a local scribe agent on every node of our cluster, and we have several central scribe servers. We extended log4j to support writing logs to the local scribe agent,  and the local scribe agents forward the logs to the central scribe servers, at last the central scribe servers write these logs to a specified hdfs cluster used for offline processing.

Then we use hive/impale to analyse  the collected logs.

From: Public Network Services <pu...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Tuesday, August 6, 2013 1:58 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Large-scale collection of logs from multiple Hadoop nodes

Hi...

I am facing a large-scale usage scenario of log collection from a Hadoop cluster and examining ways as to how it should be implemented.

More specifically, imagine a cluster that has hundreds of nodes, each of which constantly produces Syslog events that need to be gathered an analyzed at another point. The total amount of logs could be tens of gigabytes per day, if not more, and the reception rate in the order of thousands of events per second, if not more.

One solution is to send those events over the network (e.g., using using flume) and collect them in one or more (less than 5) nodes in the cluster, or in another location, whereby the logs will be processed by a either constantly MapReduce job, or by non-Hadoop servers running some log processing application.

Another approach could be to deposit all these events into a queuing system like ActiveMQ or RabbitMQ, or whatever.

In all cases, the main objective is to be able to do real-time log analysis.

What would be the best way of implementing the above scenario?

Thanks!

PNS

Re: Large-scale collection of logs from multiple Hadoop nodes

Posted by 武泽胜 <wu...@xiaomi.com>.

We have the same scenario as you described. The following is our solution, just FYI:

We installed a local scribe agent on every node of our cluster, and we have several central scribe servers. We extended log4j to support writing logs to the local scribe agent,  and the local scribe agents forward the logs to the central scribe servers, at last the central scribe servers write these logs to a specified hdfs cluster used for offline processing.

Then we use hive/impale to analyse  the collected logs.

From: Public Network Services <pu...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Tuesday, August 6, 2013 1:58 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Large-scale collection of logs from multiple Hadoop nodes

Hi...

I am facing a large-scale usage scenario of log collection from a Hadoop cluster and examining ways as to how it should be implemented.

More specifically, imagine a cluster that has hundreds of nodes, each of which constantly produces Syslog events that need to be gathered an analyzed at another point. The total amount of logs could be tens of gigabytes per day, if not more, and the reception rate in the order of thousands of events per second, if not more.

One solution is to send those events over the network (e.g., using using flume) and collect them in one or more (less than 5) nodes in the cluster, or in another location, whereby the logs will be processed by a either constantly MapReduce job, or by non-Hadoop servers running some log processing application.

Another approach could be to deposit all these events into a queuing system like ActiveMQ or RabbitMQ, or whatever.

In all cases, the main objective is to be able to do real-time log analysis.

What would be the best way of implementing the above scenario?

Thanks!

PNS

Re: Large-scale collection of logs from multiple Hadoop nodes

Posted by 武泽胜 <wu...@xiaomi.com>.

We have the same scenario as you described. The following is our solution, just FYI:

We installed a local scribe agent on every node of our cluster, and we have several central scribe servers. We extended log4j to support writing logs to the local scribe agent,  and the local scribe agents forward the logs to the central scribe servers, at last the central scribe servers write these logs to a specified hdfs cluster used for offline processing.

Then we use hive/impale to analyse  the collected logs.

From: Public Network Services <pu...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Tuesday, August 6, 2013 1:58 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Large-scale collection of logs from multiple Hadoop nodes

Hi...

I am facing a large-scale usage scenario of log collection from a Hadoop cluster and examining ways as to how it should be implemented.

More specifically, imagine a cluster that has hundreds of nodes, each of which constantly produces Syslog events that need to be gathered an analyzed at another point. The total amount of logs could be tens of gigabytes per day, if not more, and the reception rate in the order of thousands of events per second, if not more.

One solution is to send those events over the network (e.g., using using flume) and collect them in one or more (less than 5) nodes in the cluster, or in another location, whereby the logs will be processed by a either constantly MapReduce job, or by non-Hadoop servers running some log processing application.

Another approach could be to deposit all these events into a queuing system like ActiveMQ or RabbitMQ, or whatever.

In all cases, the main objective is to be able to do real-time log analysis.

What would be the best way of implementing the above scenario?

Thanks!

PNS