You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Sharninder <sh...@gmail.com> on 2014/07/01 06:26:42 UTC

Re: Flume NG and S3

No reason to not use flume except for the fact that S3, since its over the
wire, will be a lot slower than a local hdfs cluster in which case you need
a big enough channel to hold events not yet processed out of the sink. If
you have a fast enough pipe, you can very well use flume for this sort of
use-case.

The reason the author might have moved to kafka, and I'm just speculating
here, is that kafka provides him better buffering support for exactly the
case I've written above.

HTH
Sharninder

On Mon, Jun 30, 2014 at 7:57 PM, Máté Gulyás <gu...@dmlab.hu> wrote:

> Hi!
>
> I would like to use flume to aggregate and send logs to an S3 bucket.
> I did some research, but the last article I found on the topic was
> more then a year old and the author abandoned Flume for Kafka. My
> other concern is that most of the articles were written for Flume OG,
> not NG.
> Is there any reason why I should not use flume to sink messages to S3?
>
>
> Thanks in advance.
>
> Mate Gulyas
> Lead Developer at Dmlab
>

Re: Flume NG and S3

Posted by Máté Gulyás <gu...@dmlab.hu>.

We run flume on the EC2 instances and sink the aggregates to S3, but
we have to that (S3) due to cost constraints.

Mate

On Tue, Jul 1, 2014 at 10:44 AM, Asim Zafir <as...@gmail.com> wrote:
> you will have to see how much of a performance comprise going to be flume
> sink to s3. i would highly recommend using flume 1.5.0+ due to whole lot of
> bug fixes and optimization it comes with. Moreover, if you can afford,
> instead of going s3 route, fire up some EBS volume on EC2 and setup a HDFS
> cluster and sink the files there. that would be much better then going to s3
> route
>
>
> On Tue, Jul 1, 2014 at 1:42 AM, Máté Gulyás <gu...@dmlab.hu> wrote:
>>
>> Our current stack has Flume wwith Socket source and HDFS sink. We move
>> to AWS and keeping flume would be a great time saver. Kinesis looks
>> good, but If I can use flume I would stick with it. Due to S3 PUT
>> price, we have to aggregate and flume does that with the filechannel.
>>
>> Mate Gulyas
>>
>> On Tue, Jul 1, 2014 at 10:05 AM, Nitin Pawar <ni...@gmail.com>
>> wrote:
>> > If you are heavily dependent on AWS stack then instead of kafka you can
>> > look
>> > at AWS Kinesis and then from their on there is good integration
>> > available to
>> > AWS s3 or any other service you want to dump data.
>> >
>> >
>> >
>> >
>> > On Tue, Jul 1, 2014 at 1:33 PM, Asim Zafir <as...@gmail.com> wrote:
>> >>
>> >> Kafka's framework is designed for scalable read i/o's then for a
>> >> massive
>> >> write event push coming to a centralize storage such as that of hdfs.
>> >>
>> >> not sure, how flume's avro sink to s3 would turn out for entire flume
>> >> pipeline. i suspect it will be fatal to carry on a memory channel and
>> >> even
>> >> if you have a file chnanel on the flume agent/collectors, it is very
>> >> likely
>> >> it will cause buffering on the channel.
>> >>
>> >>
>> >>
>> >>
>> >> On Mon, Jun 30, 2014 at 11:47 PM, Máté Gulyás <gu...@dmlab.hu> wrote:
>> >>>
>> >>> Please see my comments inline.
>> >>>
>> >>> YIMEN YIMGA Gael wrote:
>> >>> > Could you please communicate the link of the article you read please
>> >>> > ?
>> >>> https://gist.github.com/crowdmatt/5256881 and the last comment.
>> >>>
>> >>> Sharninder wrote
>> >>> > No reason to not use flume except for the fact that S3, since its
>> >>> > over
>> >>> > the wire, will be a lot slower than a local hdfs cluster in which
>> >>> > case you
>> >>> > need a big enough channel to hold events not yet processed out of
>> >>> > the sink.
>> >>> > If you have a fast enough pipe, you can very well use flume for this
>> >>> > sort of
>> >>> > use-case.
>> >>> I plan to aggregate 5-15GB data with Filechannel, as I want to flush
>> >>> to S3 every hour on every node. As far as I know Flume can gzip it, so
>> >>> the size would be about 500MB-1,5GB.
>> >>>
>> >>> Thanks for the feedback, I will write If I have any results.
>> >>>
>> >>> Mate Gulyas
>> >>>
>> >>> On Tue, Jul 1, 2014 at 6:26 AM, Sharninder <sh...@gmail.com>
>> >>> wrote:
>> >>> > No reason to not use flume except for the fact that S3, since its
>> >>> > over
>> >>> > the
>> >>> > wire, will be a lot slower than a local hdfs cluster in which case
>> >>> > you
>> >>> > need
>> >>> > a big enough channel to hold events not yet processed out of the
>> >>> > sink.
>> >>> > If
>> >>> > you have a fast enough pipe, you can very well use flume for this
>> >>> > sort
>> >>> > of
>> >>> > use-case.
>> >>> >
>> >>> > The reason the author might have moved to kafka, and I'm just
>> >>> > speculating
>> >>> > here, is that kafka provides him better buffering support for
>> >>> > exactly
>> >>> > the
>> >>> > case I've written above.
>> >>> >
>> >>> > HTH
>> >>> > Sharninder
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Mon, Jun 30, 2014 at 7:57 PM, Máté Gulyás <gu...@dmlab.hu>
>> >>> > wrote:
>> >>> >>
>> >>> >> Hi!
>> >>> >>
>> >>> >> I would like to use flume to aggregate and send logs to an S3
>> >>> >> bucket.
>> >>> >> I did some research, but the last article I found on the topic was
>> >>> >> more then a year old and the author abandoned Flume for Kafka. My
>> >>> >> other concern is that most of the articles were written for Flume
>> >>> >> OG,
>> >>> >> not NG.
>> >>> >> Is there any reason why I should not use flume to sink messages to
>> >>> >> S3?
>> >>> >>
>> >>> >>
>> >>> >> Thanks in advance.
>> >>> >>
>> >>> >> Mate Gulyas
>> >>> >> Lead Developer at Dmlab
>> >>> >
>> >>> >
>> >>
>> >>
>> >
>> >
>> >
>> > --
>> > Nitin Pawar
>
>

Re: Flume NG and S3

Posted by Asim Zafir <as...@gmail.com>.

you will have to see how much of a performance comprise going to be flume
sink to s3. i would highly recommend using flume 1.5.0+ due to whole lot of
bug fixes and optimization it comes with. Moreover, if you can afford,
instead of going s3 route, fire up some EBS volume on EC2 and setup a HDFS
cluster and sink the files there. that would be much better then going to
s3 route


On Tue, Jul 1, 2014 at 1:42 AM, Máté Gulyás <gu...@dmlab.hu> wrote:

> Our current stack has Flume wwith Socket source and HDFS sink. We move
> to AWS and keeping flume would be a great time saver. Kinesis looks
> good, but If I can use flume I would stick with it. Due to S3 PUT
> price, we have to aggregate and flume does that with the filechannel.
>
> Mate Gulyas
>
> On Tue, Jul 1, 2014 at 10:05 AM, Nitin Pawar <ni...@gmail.com>
> wrote:
> > If you are heavily dependent on AWS stack then instead of kafka you can
> look
> > at AWS Kinesis and then from their on there is good integration
> available to
> > AWS s3 or any other service you want to dump data.
> >
> >
> >
> >
> > On Tue, Jul 1, 2014 at 1:33 PM, Asim Zafir <as...@gmail.com> wrote:
> >>
> >> Kafka's framework is designed for scalable read i/o's then for a massive
> >> write event push coming to a centralize storage such as that of hdfs.
> >>
> >> not sure, how flume's avro sink to s3 would turn out for entire flume
> >> pipeline. i suspect it will be fatal to carry on a memory channel and
> even
> >> if you have a file chnanel on the flume agent/collectors, it is very
> likely
> >> it will cause buffering on the channel.
> >>
> >>
> >>
> >>
> >> On Mon, Jun 30, 2014 at 11:47 PM, Máté Gulyás <gu...@dmlab.hu> wrote:
> >>>
> >>> Please see my comments inline.
> >>>
> >>> YIMEN YIMGA Gael wrote:
> >>> > Could you please communicate the link of the article you read please
> ?
> >>> https://gist.github.com/crowdmatt/5256881 and the last comment.
> >>>
> >>> Sharninder wrote
> >>> > No reason to not use flume except for the fact that S3, since its
> over
> >>> > the wire, will be a lot slower than a local hdfs cluster in which
> case you
> >>> > need a big enough channel to hold events not yet processed out of
> the sink.
> >>> > If you have a fast enough pipe, you can very well use flume for this
> sort of
> >>> > use-case.
> >>> I plan to aggregate 5-15GB data with Filechannel, as I want to flush
> >>> to S3 every hour on every node. As far as I know Flume can gzip it, so
> >>> the size would be about 500MB-1,5GB.
> >>>
> >>> Thanks for the feedback, I will write If I have any results.
> >>>
> >>> Mate Gulyas
> >>>
> >>> On Tue, Jul 1, 2014 at 6:26 AM, Sharninder <sh...@gmail.com>
> wrote:
> >>> > No reason to not use flume except for the fact that S3, since its
> over
> >>> > the
> >>> > wire, will be a lot slower than a local hdfs cluster in which case
> you
> >>> > need
> >>> > a big enough channel to hold events not yet processed out of the
> sink.
> >>> > If
> >>> > you have a fast enough pipe, you can very well use flume for this
> sort
> >>> > of
> >>> > use-case.
> >>> >
> >>> > The reason the author might have moved to kafka, and I'm just
> >>> > speculating
> >>> > here, is that kafka provides him better buffering support for exactly
> >>> > the
> >>> > case I've written above.
> >>> >
> >>> > HTH
> >>> > Sharninder
> >>> >
> >>> >
> >>> >
> >>> > On Mon, Jun 30, 2014 at 7:57 PM, Máté Gulyás <gu...@dmlab.hu>
> wrote:
> >>> >>
> >>> >> Hi!
> >>> >>
> >>> >> I would like to use flume to aggregate and send logs to an S3
> bucket.
> >>> >> I did some research, but the last article I found on the topic was
> >>> >> more then a year old and the author abandoned Flume for Kafka. My
> >>> >> other concern is that most of the articles were written for Flume
> OG,
> >>> >> not NG.
> >>> >> Is there any reason why I should not use flume to sink messages to
> S3?
> >>> >>
> >>> >>
> >>> >> Thanks in advance.
> >>> >>
> >>> >> Mate Gulyas
> >>> >> Lead Developer at Dmlab
> >>> >
> >>> >
> >>
> >>
> >
> >
> >
> > --
> > Nitin Pawar
>

Re: Flume NG and S3

Posted by Máté Gulyás <gu...@dmlab.hu>.

Our current stack has Flume wwith Socket source and HDFS sink. We move
to AWS and keeping flume would be a great time saver. Kinesis looks
good, but If I can use flume I would stick with it. Due to S3 PUT
price, we have to aggregate and flume does that with the filechannel.

Mate Gulyas

On Tue, Jul 1, 2014 at 10:05 AM, Nitin Pawar <ni...@gmail.com> wrote:
> If you are heavily dependent on AWS stack then instead of kafka you can look
> at AWS Kinesis and then from their on there is good integration available to
> AWS s3 or any other service you want to dump data.
>
>
>
>
> On Tue, Jul 1, 2014 at 1:33 PM, Asim Zafir <as...@gmail.com> wrote:
>>
>> Kafka's framework is designed for scalable read i/o's then for a massive
>> write event push coming to a centralize storage such as that of hdfs.
>>
>> not sure, how flume's avro sink to s3 would turn out for entire flume
>> pipeline. i suspect it will be fatal to carry on a memory channel and even
>> if you have a file chnanel on the flume agent/collectors, it is very likely
>> it will cause buffering on the channel.
>>
>>
>>
>>
>> On Mon, Jun 30, 2014 at 11:47 PM, Máté Gulyás <gu...@dmlab.hu> wrote:
>>>
>>> Please see my comments inline.
>>>
>>> YIMEN YIMGA Gael wrote:
>>> > Could you please communicate the link of the article you read please ?
>>> https://gist.github.com/crowdmatt/5256881 and the last comment.
>>>
>>> Sharninder wrote
>>> > No reason to not use flume except for the fact that S3, since its over
>>> > the wire, will be a lot slower than a local hdfs cluster in which case you
>>> > need a big enough channel to hold events not yet processed out of the sink.
>>> > If you have a fast enough pipe, you can very well use flume for this sort of
>>> > use-case.
>>> I plan to aggregate 5-15GB data with Filechannel, as I want to flush
>>> to S3 every hour on every node. As far as I know Flume can gzip it, so
>>> the size would be about 500MB-1,5GB.
>>>
>>> Thanks for the feedback, I will write If I have any results.
>>>
>>> Mate Gulyas
>>>
>>> On Tue, Jul 1, 2014 at 6:26 AM, Sharninder <sh...@gmail.com> wrote:
>>> > No reason to not use flume except for the fact that S3, since its over
>>> > the
>>> > wire, will be a lot slower than a local hdfs cluster in which case you
>>> > need
>>> > a big enough channel to hold events not yet processed out of the sink.
>>> > If
>>> > you have a fast enough pipe, you can very well use flume for this sort
>>> > of
>>> > use-case.
>>> >
>>> > The reason the author might have moved to kafka, and I'm just
>>> > speculating
>>> > here, is that kafka provides him better buffering support for exactly
>>> > the
>>> > case I've written above.
>>> >
>>> > HTH
>>> > Sharninder
>>> >
>>> >
>>> >
>>> > On Mon, Jun 30, 2014 at 7:57 PM, Máté Gulyás <gu...@dmlab.hu> wrote:
>>> >>
>>> >> Hi!
>>> >>
>>> >> I would like to use flume to aggregate and send logs to an S3 bucket.
>>> >> I did some research, but the last article I found on the topic was
>>> >> more then a year old and the author abandoned Flume for Kafka. My
>>> >> other concern is that most of the articles were written for Flume OG,
>>> >> not NG.
>>> >> Is there any reason why I should not use flume to sink messages to S3?
>>> >>
>>> >>
>>> >> Thanks in advance.
>>> >>
>>> >> Mate Gulyas
>>> >> Lead Developer at Dmlab
>>> >
>>> >
>>
>>
>
>
>
> --
> Nitin Pawar

Re: Flume NG and S3

Posted by Nitin Pawar <ni...@gmail.com>.

If you are heavily dependent on AWS stack then instead of kafka you can
look at AWS Kinesis and then from their on there is good integration
available to AWS s3 or any other service you want to dump data.




On Tue, Jul 1, 2014 at 1:33 PM, Asim Zafir <as...@gmail.com> wrote:

> Kafka's framework is designed for scalable read i/o's then for a massive
> write event push coming to a centralize storage such as that of hdfs.
>
> not sure, how flume's avro sink to s3 would turn out for entire flume
> pipeline. i suspect it will be fatal to carry on a memory channel and even
> if you have a file chnanel on the flume agent/collectors, it is very likely
> it will cause buffering on the channel.
>
>
>
>
> On Mon, Jun 30, 2014 at 11:47 PM, Máté Gulyás <gu...@dmlab.hu> wrote:
>
>> Please see my comments inline.
>>
>> YIMEN YIMGA Gael wrote:
>> > Could you please communicate the link of the article you read please ?
>> https://gist.github.com/crowdmatt/5256881 and the last comment.
>>
>> Sharninder wrote
>> > No reason to not use flume except for the fact that S3, since its over
>> the wire, will be a lot slower than a local hdfs cluster in which case you
>> need a big enough channel to hold events not yet processed out of the sink.
>> If you have a fast enough pipe, you can very well use flume for this sort
>> of use-case.
>> I plan to aggregate 5-15GB data with Filechannel, as I want to flush
>> to S3 every hour on every node. As far as I know Flume can gzip it, so
>> the size would be about 500MB-1,5GB.
>>
>> Thanks for the feedback, I will write If I have any results.
>>
>> Mate Gulyas
>>
>> On Tue, Jul 1, 2014 at 6:26 AM, Sharninder <sh...@gmail.com> wrote:
>> > No reason to not use flume except for the fact that S3, since its over
>> the
>> > wire, will be a lot slower than a local hdfs cluster in which case you
>> need
>> > a big enough channel to hold events not yet processed out of the sink.
>> If
>> > you have a fast enough pipe, you can very well use flume for this sort
>> of
>> > use-case.
>> >
>> > The reason the author might have moved to kafka, and I'm just
>> speculating
>> > here, is that kafka provides him better buffering support for exactly
>> the
>> > case I've written above.
>> >
>> > HTH
>> > Sharninder
>> >
>> >
>> >
>> > On Mon, Jun 30, 2014 at 7:57 PM, Máté Gulyás <gu...@dmlab.hu> wrote:
>> >>
>> >> Hi!
>> >>
>> >> I would like to use flume to aggregate and send logs to an S3 bucket.
>> >> I did some research, but the last article I found on the topic was
>> >> more then a year old and the author abandoned Flume for Kafka. My
>> >> other concern is that most of the articles were written for Flume OG,
>> >> not NG.
>> >> Is there any reason why I should not use flume to sink messages to S3?
>> >>
>> >>
>> >> Thanks in advance.
>> >>
>> >> Mate Gulyas
>> >> Lead Developer at Dmlab
>> >
>> >
>>
>
>


-- 
Nitin Pawar

Re: Flume NG and S3

Posted by Asim Zafir <as...@gmail.com>.

Kafka's framework is designed for scalable read i/o's then for a massive
write event push coming to a centralize storage such as that of hdfs.

not sure, how flume's avro sink to s3 would turn out for entire flume
pipeline. i suspect it will be fatal to carry on a memory channel and even
if you have a file chnanel on the flume agent/collectors, it is very likely
it will cause buffering on the channel.




On Mon, Jun 30, 2014 at 11:47 PM, Máté Gulyás <gu...@dmlab.hu> wrote:

> Please see my comments inline.
>
> YIMEN YIMGA Gael wrote:
> > Could you please communicate the link of the article you read please ?
> https://gist.github.com/crowdmatt/5256881 and the last comment.
>
> Sharninder wrote
> > No reason to not use flume except for the fact that S3, since its over
> the wire, will be a lot slower than a local hdfs cluster in which case you
> need a big enough channel to hold events not yet processed out of the sink.
> If you have a fast enough pipe, you can very well use flume for this sort
> of use-case.
> I plan to aggregate 5-15GB data with Filechannel, as I want to flush
> to S3 every hour on every node. As far as I know Flume can gzip it, so
> the size would be about 500MB-1,5GB.
>
> Thanks for the feedback, I will write If I have any results.
>
> Mate Gulyas
>
> On Tue, Jul 1, 2014 at 6:26 AM, Sharninder <sh...@gmail.com> wrote:
> > No reason to not use flume except for the fact that S3, since its over
> the
> > wire, will be a lot slower than a local hdfs cluster in which case you
> need
> > a big enough channel to hold events not yet processed out of the sink. If
> > you have a fast enough pipe, you can very well use flume for this sort of
> > use-case.
> >
> > The reason the author might have moved to kafka, and I'm just speculating
> > here, is that kafka provides him better buffering support for exactly the
> > case I've written above.
> >
> > HTH
> > Sharninder
> >
> >
> >
> > On Mon, Jun 30, 2014 at 7:57 PM, Máté Gulyás <gu...@dmlab.hu> wrote:
> >>
> >> Hi!
> >>
> >> I would like to use flume to aggregate and send logs to an S3 bucket.
> >> I did some research, but the last article I found on the topic was
> >> more then a year old and the author abandoned Flume for Kafka. My
> >> other concern is that most of the articles were written for Flume OG,
> >> not NG.
> >> Is there any reason why I should not use flume to sink messages to S3?
> >>
> >>
> >> Thanks in advance.
> >>
> >> Mate Gulyas
> >> Lead Developer at Dmlab
> >
> >
>

Re: Flume NG and S3

Posted by Máté Gulyás <gu...@dmlab.hu>.

Please see my comments inline.

YIMEN YIMGA Gael wrote:
> Could you please communicate the link of the article you read please ?
https://gist.github.com/crowdmatt/5256881 and the last comment.

Sharninder wrote
> No reason to not use flume except for the fact that S3, since its over the wire, will be a lot slower than a local hdfs cluster in which case you need a big enough channel to hold events not yet processed out of the sink. If you have a fast enough pipe, you can very well use flume for this sort of use-case.
I plan to aggregate 5-15GB data with Filechannel, as I want to flush
to S3 every hour on every node. As far as I know Flume can gzip it, so
the size would be about 500MB-1,5GB.

Thanks for the feedback, I will write If I have any results.

Mate Gulyas

On Tue, Jul 1, 2014 at 6:26 AM, Sharninder <sh...@gmail.com> wrote:
> No reason to not use flume except for the fact that S3, since its over the
> wire, will be a lot slower than a local hdfs cluster in which case you need
> a big enough channel to hold events not yet processed out of the sink. If
> you have a fast enough pipe, you can very well use flume for this sort of
> use-case.
>
> The reason the author might have moved to kafka, and I'm just speculating
> here, is that kafka provides him better buffering support for exactly the
> case I've written above.
>
> HTH
> Sharninder
>
>
>
> On Mon, Jun 30, 2014 at 7:57 PM, Máté Gulyás <gu...@dmlab.hu> wrote:
>>
>> Hi!
>>
>> I would like to use flume to aggregate and send logs to an S3 bucket.
>> I did some research, but the last article I found on the topic was
>> more then a year old and the author abandoned Flume for Kafka. My
>> other concern is that most of the articles were written for Flume OG,
>> not NG.
>> Is there any reason why I should not use flume to sink messages to S3?
>>
>>
>> Thanks in advance.
>>
>> Mate Gulyas
>> Lead Developer at Dmlab
>
>