You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Karl Hennig <kh...@baynote.com> on 2012/04/21 00:14:17 UTC

Feedback on real world production experience with Flume

I am investigating automated methods of moving our data from the web tier into HDFS for processing, a process that's performed periodically.

I am looking for feedback from anyone who has actually used Flume in a production setup (redundant, failover) successfully.  I understand it is now being largely rearchitected during its incubation as Apache Flume-NG, so I don't have full confidence in the old, stable releases.

The other option would be to write our own tools.  What methods are you using for these kinds of tasks?  Did you write your own or does Flume (or something else) work for you?

I'm also on the Flume mailing list, but I wanted to ask these questions here because I'm interested in Flume _and_ alternatives.

Thank you!

Re: Feedback on real world production experience with Flume

Posted by Chen He <ai...@gmail.com>.

Can the NFS become the bottleneck ?

Chen

On Sat, Apr 21, 2012 at 5:23 PM, Edward Capriolo <ed...@gmail.com>wrote:

> It seems pretty relevant. If you can directly log via NFS that is a
> viable alternative.
>
> On Sat, Apr 21, 2012 at 11:42 AM, alo alt <wg...@googlemail.com>
> wrote:
> > We decided NO product and vendor advertising on apache mailing lists!
> > I do not understand why you'll put that closed source stuff from your
> employe in the room. It has nothing to do with flume or the use cases!
> >
> > --
> > Alexander Lorenz
> > http://mapredit.blogspot.com
> >
> > On Apr 21, 2012, at 4:06 PM, M. C. Srivas wrote:
> >
> >> Karl,
> >>
> >> since you did ask for alternatives,  people using MapR prefer to use the
> >> NFS access to directly deposit data (or access it).  Works seamlessly
> from
> >> all Linuxes, Solaris, Windows, AIX and a myriad of other legacy systems
> >> without having to load any agents on those machines. And it is fully
> >> automatic HA
> >>
> >> Since compression is built-in in MapR, the data gets compressed coming
> in
> >> over NFS automatically without much fuss.
> >>
> >> Wrt to performance,  can get about 870 MB/s per node if you have 10GigE
> >> attached (of course, with compression, the effective throughput will
> >> surpass that based on how good the data can be squeezed).
> >>
> >>
> >> On Fri, Apr 20, 2012 at 3:14 PM, Karl Hennig <kh...@baynote.com>
> wrote:
> >>
> >>> I am investigating automated methods of moving our data from the web
> tier
> >>> into HDFS for processing, a process that's performed periodically.
> >>>
> >>> I am looking for feedback from anyone who has actually used Flume in a
> >>> production setup (redundant, failover) successfully.  I understand it
> is
> >>> now being largely rearchitected during its incubation as Apache
> Flume-NG,
> >>> so I don't have full confidence in the old, stable releases.
> >>>
> >>> The other option would be to write our own tools.  What methods are you
> >>> using for these kinds of tasks?  Did you write your own or does Flume
> (or
> >>> something else) work for you?
> >>>
> >>> I'm also on the Flume mailing list, but I wanted to ask these questions
> >>> here because I'm interested in Flume _and_ alternatives.
> >>>
> >>> Thank you!
> >>>
> >>>
> >
>

Re: Feedback on real world production experience with Flume

Posted by Bill Graham <bi...@gmail.com>.

+1 on Edward's comment.

The MapR comment was relevant and informative and the original poster never
said he was only interested in open source options.

On Sunday, April 22, 2012, Michael Segel wrote:

> Gee Edward, what about putting a link to a company website or your blog in
> your signature... ;-)
>
> Seriously one could also mention fuse, right?  ;-)
>
>
> Sent from my iPhone
>
> On Apr 22, 2012, at 7:15 AM, "Edward Capriolo" <ed...@gmail.com>
> wrote:
>
> > I think this is valid to talk about for example one need not need a
> > decentralized collector if they can just write log directly to
> > decentralized files in a decentralized file system. In any case it was
> > not even a hard vendor pitch. It was someone describing how they
> > handle centralized logging. It stated facts and it was informative.
> >
> > Lets face it, if fuse-mounting-hdfs or directly soft mounting NFS in a
> > way that performs well many of the use cases for flume and scribe like
> > tools would be gone. (not all but many)
> >
> > I never knew there was a rule that discussing alternative software on
> > a mailing list. It seems like a closed minded thing. I also doubt the
> > ASF would back a rule like that. Are we not allowed to talk about EMR
> > or S3, or am I not even allowed to mention S3?
> >
> > Can flume run on ec2 and log to S3? (oops party foul I guess I cant ask
> that.)
> >
> > Edward
> >
> > On Sun, Apr 22, 2012 at 12:59 AM, Alexander Lorenz
> > <wg...@googlemail.com> wrote:
> >> no. That is the Flume Open Source Mailinglist. Not a vendor list.
> >>
> >> NFS logging has nothing to do with decentralized collectors like Flume,
> JMS or Scribe.
> >>
> >> sent via my mobile device
> >>
> >> On Apr 22, 2012, at 12:23 AM, Edward Capriolo <ed...@gmail.com>
> wrote:
> >>
> >>> It seems pretty relevant. If you can directly log via NFS that is a
> >>> viable alternative.
> >>>
> >>> On Sat, Apr 21, 2012 at 11:42 AM, alo alt <wg...@googlemail.com>
> wrote:
> >>>> We decided NO product and vendor advertising on apache mailing lists!
> >>>> I do not understand why you'll put that closed source stuff from your
> employe in the room. It has nothing to do with flume or the use cases!
> >>>>
> >>>> --
> >>>> Alexander Lorenz
> >>>> http://mapredit.blogspot.com
> >>>>
> >>>> On Apr 21, 2012, at 4:06 PM, M. C. Srivas wrote:
> >>>>
> >>>>> Karl,
> >>>>>
> >>>>> since you did ask for alternatives,  people using MapR prefer to use
> the
> >>>>> NFS access to directly deposit data (or access it).  Works
> seamlessly from
> >>>>> all Linuxes, Solaris, Windows, AIX and a myriad of other legacy
> systems
> >>>>> without having to load any agents on those machines. And it is fully
> >>>>> automatic HA
> >>>>>
> >>>>> Since compression is built-in in MapR, the data gets compressed
> coming in
> >>>>> over NFS automatically without much fuss.
> >>>>>
> >>>>> Wrt to performance,  can get about 870 MB/s per node if you have
> 10GigE
> >>>>> attached (of course, with compression, the effective throughput will
> >>>>> surpass that based on how good the data can be squeezed).
> >>>>>
> >>>>>
> >>>>> On Fri, Apr 20, 2012 at 3:14 PM, Karl Hennig <kh...@baynote.com>
> wrote:
> >>>>>
> >>>>>> I am investigating automated methods of moving our data from the
> web tier
> >>>>>> into HDFS for processing, a process that's performed periodically.
> >>>>>>
> >>>>>> I am looking for feedback from anyone who has actually used Flume
> in a
> >>>>>> production setup (redundant, failover) successfully.  I understand
> it is
> >>>>>> now being largely rearchitected during its incubation as Apache
> Flume-NG,
> >>>>>> so I don't have full confidence in the old, stable releases.
> >>>>>>
> >>>>>> The other option would be to write our own tools.  What methods are
> you
> >>>>>> using for these kinds of tasks?  Did you write your own or does
> Flume (or
> >>>>>> something else) work for you?
> >>>>>>
> >>>>>> I'm a



-- 
*Note that I'm no longer using my Yahoo! email address. Please email me at
billgraham@gmail.com going forward.*

Re: Feedback on real world production experience with Flume

Posted by Michael Segel <mi...@hotmail.com>.

Gee Edward, what about putting a link to a company website or your blog in your signature... ;-)

Seriously one could also mention fuse, right?  ;-)


Sent from my iPhone

On Apr 22, 2012, at 7:15 AM, "Edward Capriolo" <ed...@gmail.com> wrote:

> I think this is valid to talk about for example one need not need a
> decentralized collector if they can just write log directly to
> decentralized files in a decentralized file system. In any case it was
> not even a hard vendor pitch. It was someone describing how they
> handle centralized logging. It stated facts and it was informative.
> 
> Lets face it, if fuse-mounting-hdfs or directly soft mounting NFS in a
> way that performs well many of the use cases for flume and scribe like
> tools would be gone. (not all but many)
> 
> I never knew there was a rule that discussing alternative software on
> a mailing list. It seems like a closed minded thing. I also doubt the
> ASF would back a rule like that. Are we not allowed to talk about EMR
> or S3, or am I not even allowed to mention S3?
> 
> Can flume run on ec2 and log to S3? (oops party foul I guess I cant ask that.)
> 
> Edward
> 
> On Sun, Apr 22, 2012 at 12:59 AM, Alexander Lorenz
> <wg...@googlemail.com> wrote:
>> no. That is the Flume Open Source Mailinglist. Not a vendor list.
>> 
>> NFS logging has nothing to do with decentralized collectors like Flume, JMS or Scribe.
>> 
>> sent via my mobile device
>> 
>> On Apr 22, 2012, at 12:23 AM, Edward Capriolo <ed...@gmail.com> wrote:
>> 
>>> It seems pretty relevant. If you can directly log via NFS that is a
>>> viable alternative.
>>> 
>>> On Sat, Apr 21, 2012 at 11:42 AM, alo alt <wg...@googlemail.com> wrote:
>>>> We decided NO product and vendor advertising on apache mailing lists!
>>>> I do not understand why you'll put that closed source stuff from your employe in the room. It has nothing to do with flume or the use cases!
>>>> 
>>>> --
>>>> Alexander Lorenz
>>>> http://mapredit.blogspot.com
>>>> 
>>>> On Apr 21, 2012, at 4:06 PM, M. C. Srivas wrote:
>>>> 
>>>>> Karl,
>>>>> 
>>>>> since you did ask for alternatives,  people using MapR prefer to use the
>>>>> NFS access to directly deposit data (or access it).  Works seamlessly from
>>>>> all Linuxes, Solaris, Windows, AIX and a myriad of other legacy systems
>>>>> without having to load any agents on those machines. And it is fully
>>>>> automatic HA
>>>>> 
>>>>> Since compression is built-in in MapR, the data gets compressed coming in
>>>>> over NFS automatically without much fuss.
>>>>> 
>>>>> Wrt to performance,  can get about 870 MB/s per node if you have 10GigE
>>>>> attached (of course, with compression, the effective throughput will
>>>>> surpass that based on how good the data can be squeezed).
>>>>> 
>>>>> 
>>>>> On Fri, Apr 20, 2012 at 3:14 PM, Karl Hennig <kh...@baynote.com> wrote:
>>>>> 
>>>>>> I am investigating automated methods of moving our data from the web tier
>>>>>> into HDFS for processing, a process that's performed periodically.
>>>>>> 
>>>>>> I am looking for feedback from anyone who has actually used Flume in a
>>>>>> production setup (redundant, failover) successfully.  I understand it is
>>>>>> now being largely rearchitected during its incubation as Apache Flume-NG,
>>>>>> so I don't have full confidence in the old, stable releases.
>>>>>> 
>>>>>> The other option would be to write our own tools.  What methods are you
>>>>>> using for these kinds of tasks?  Did you write your own or does Flume (or
>>>>>> something else) work for you?
>>>>>> 
>>>>>> I'm also on the Flume mailing list, but I wanted to ask these questions
>>>>>> here because I'm interested in Flume _and_ alternatives.
>>>>>> 
>>>>>> Thank you!
>>>>>> 
>>>>>> 
>>>>

Re: Feedback on real world production experience with Flume

Posted by Edward Capriolo <ed...@gmail.com>.

I think this is valid to talk about for example one need not need a
decentralized collector if they can just write log directly to
decentralized files in a decentralized file system. In any case it was
not even a hard vendor pitch. It was someone describing how they
handle centralized logging. It stated facts and it was informative.

Lets face it, if fuse-mounting-hdfs or directly soft mounting NFS in a
way that performs well many of the use cases for flume and scribe like
tools would be gone. (not all but many)

I never knew there was a rule that discussing alternative software on
a mailing list. It seems like a closed minded thing. I also doubt the
ASF would back a rule like that. Are we not allowed to talk about EMR
or S3, or am I not even allowed to mention S3?

Can flume run on ec2 and log to S3? (oops party foul I guess I cant ask that.)

Edward

On Sun, Apr 22, 2012 at 12:59 AM, Alexander Lorenz
<wg...@googlemail.com> wrote:
> no. That is the Flume Open Source Mailinglist. Not a vendor list.
>
> NFS logging has nothing to do with decentralized collectors like Flume, JMS or Scribe.
>
> sent via my mobile device
>
> On Apr 22, 2012, at 12:23 AM, Edward Capriolo <ed...@gmail.com> wrote:
>
>> It seems pretty relevant. If you can directly log via NFS that is a
>> viable alternative.
>>
>> On Sat, Apr 21, 2012 at 11:42 AM, alo alt <wg...@googlemail.com> wrote:
>>> We decided NO product and vendor advertising on apache mailing lists!
>>> I do not understand why you'll put that closed source stuff from your employe in the room. It has nothing to do with flume or the use cases!
>>>
>>> --
>>> Alexander Lorenz
>>> http://mapredit.blogspot.com
>>>
>>> On Apr 21, 2012, at 4:06 PM, M. C. Srivas wrote:
>>>
>>>> Karl,
>>>>
>>>> since you did ask for alternatives,  people using MapR prefer to use the
>>>> NFS access to directly deposit data (or access it).  Works seamlessly from
>>>> all Linuxes, Solaris, Windows, AIX and a myriad of other legacy systems
>>>> without having to load any agents on those machines. And it is fully
>>>> automatic HA
>>>>
>>>> Since compression is built-in in MapR, the data gets compressed coming in
>>>> over NFS automatically without much fuss.
>>>>
>>>> Wrt to performance,  can get about 870 MB/s per node if you have 10GigE
>>>> attached (of course, with compression, the effective throughput will
>>>> surpass that based on how good the data can be squeezed).
>>>>
>>>>
>>>> On Fri, Apr 20, 2012 at 3:14 PM, Karl Hennig <kh...@baynote.com> wrote:
>>>>
>>>>> I am investigating automated methods of moving our data from the web tier
>>>>> into HDFS for processing, a process that's performed periodically.
>>>>>
>>>>> I am looking for feedback from anyone who has actually used Flume in a
>>>>> production setup (redundant, failover) successfully.  I understand it is
>>>>> now being largely rearchitected during its incubation as Apache Flume-NG,
>>>>> so I don't have full confidence in the old, stable releases.
>>>>>
>>>>> The other option would be to write our own tools.  What methods are you
>>>>> using for these kinds of tasks?  Did you write your own or does Flume (or
>>>>> something else) work for you?
>>>>>
>>>>> I'm also on the Flume mailing list, but I wanted to ask these questions
>>>>> here because I'm interested in Flume _and_ alternatives.
>>>>>
>>>>> Thank you!
>>>>>
>>>>>
>>>

Re: Feedback on real world production experience with Flume

Posted by Alexander Lorenz <wg...@googlemail.com>.

no. That is the Flume Open Source Mailinglist. Not a vendor list. 

NFS logging has nothing to do with decentralized collectors like Flume, JMS or Scribe. 

sent via my mobile device

On Apr 22, 2012, at 12:23 AM, Edward Capriolo <ed...@gmail.com> wrote:

> It seems pretty relevant. If you can directly log via NFS that is a
> viable alternative.
> 
> On Sat, Apr 21, 2012 at 11:42 AM, alo alt <wg...@googlemail.com> wrote:
>> We decided NO product and vendor advertising on apache mailing lists!
>> I do not understand why you'll put that closed source stuff from your employe in the room. It has nothing to do with flume or the use cases!
>> 
>> --
>> Alexander Lorenz
>> http://mapredit.blogspot.com
>> 
>> On Apr 21, 2012, at 4:06 PM, M. C. Srivas wrote:
>> 
>>> Karl,
>>> 
>>> since you did ask for alternatives,  people using MapR prefer to use the
>>> NFS access to directly deposit data (or access it).  Works seamlessly from
>>> all Linuxes, Solaris, Windows, AIX and a myriad of other legacy systems
>>> without having to load any agents on those machines. And it is fully
>>> automatic HA
>>> 
>>> Since compression is built-in in MapR, the data gets compressed coming in
>>> over NFS automatically without much fuss.
>>> 
>>> Wrt to performance,  can get about 870 MB/s per node if you have 10GigE
>>> attached (of course, with compression, the effective throughput will
>>> surpass that based on how good the data can be squeezed).
>>> 
>>> 
>>> On Fri, Apr 20, 2012 at 3:14 PM, Karl Hennig <kh...@baynote.com> wrote:
>>> 
>>>> I am investigating automated methods of moving our data from the web tier
>>>> into HDFS for processing, a process that's performed periodically.
>>>> 
>>>> I am looking for feedback from anyone who has actually used Flume in a
>>>> production setup (redundant, failover) successfully.  I understand it is
>>>> now being largely rearchitected during its incubation as Apache Flume-NG,
>>>> so I don't have full confidence in the old, stable releases.
>>>> 
>>>> The other option would be to write our own tools.  What methods are you
>>>> using for these kinds of tasks?  Did you write your own or does Flume (or
>>>> something else) work for you?
>>>> 
>>>> I'm also on the Flume mailing list, but I wanted to ask these questions
>>>> here because I'm interested in Flume _and_ alternatives.
>>>> 
>>>> Thank you!
>>>> 
>>>> 
>>

Re: Feedback on real world production experience with Flume

Posted by Edward Capriolo <ed...@gmail.com>.

It seems pretty relevant. If you can directly log via NFS that is a
viable alternative.

On Sat, Apr 21, 2012 at 11:42 AM, alo alt <wg...@googlemail.com> wrote:
> We decided NO product and vendor advertising on apache mailing lists!
> I do not understand why you'll put that closed source stuff from your employe in the room. It has nothing to do with flume or the use cases!
>
> --
> Alexander Lorenz
> http://mapredit.blogspot.com
>
> On Apr 21, 2012, at 4:06 PM, M. C. Srivas wrote:
>
>> Karl,
>>
>> since you did ask for alternatives,  people using MapR prefer to use the
>> NFS access to directly deposit data (or access it).  Works seamlessly from
>> all Linuxes, Solaris, Windows, AIX and a myriad of other legacy systems
>> without having to load any agents on those machines. And it is fully
>> automatic HA
>>
>> Since compression is built-in in MapR, the data gets compressed coming in
>> over NFS automatically without much fuss.
>>
>> Wrt to performance,  can get about 870 MB/s per node if you have 10GigE
>> attached (of course, with compression, the effective throughput will
>> surpass that based on how good the data can be squeezed).
>>
>>
>> On Fri, Apr 20, 2012 at 3:14 PM, Karl Hennig <kh...@baynote.com> wrote:
>>
>>> I am investigating automated methods of moving our data from the web tier
>>> into HDFS for processing, a process that's performed periodically.
>>>
>>> I am looking for feedback from anyone who has actually used Flume in a
>>> production setup (redundant, failover) successfully.  I understand it is
>>> now being largely rearchitected during its incubation as Apache Flume-NG,
>>> so I don't have full confidence in the old, stable releases.
>>>
>>> The other option would be to write our own tools.  What methods are you
>>> using for these kinds of tasks?  Did you write your own or does Flume (or
>>> something else) work for you?
>>>
>>> I'm also on the Flume mailing list, but I wanted to ask these questions
>>> here because I'm interested in Flume _and_ alternatives.
>>>
>>> Thank you!
>>>
>>>
>

Re: Feedback on real world production experience with Flume

Posted by alo alt <wg...@googlemail.com>.

We decided NO product and vendor advertising on apache mailing lists! 
I do not understand why you'll put that closed source stuff from your employe in the room. It has nothing to do with flume or the use cases!

--
Alexander Lorenz
http://mapredit.blogspot.com

On Apr 21, 2012, at 4:06 PM, M. C. Srivas wrote:

> Karl,
> 
> since you did ask for alternatives,  people using MapR prefer to use the
> NFS access to directly deposit data (or access it).  Works seamlessly from
> all Linuxes, Solaris, Windows, AIX and a myriad of other legacy systems
> without having to load any agents on those machines. And it is fully
> automatic HA
> 
> Since compression is built-in in MapR, the data gets compressed coming in
> over NFS automatically without much fuss.
> 
> Wrt to performance,  can get about 870 MB/s per node if you have 10GigE
> attached (of course, with compression, the effective throughput will
> surpass that based on how good the data can be squeezed).
> 
> 
> On Fri, Apr 20, 2012 at 3:14 PM, Karl Hennig <kh...@baynote.com> wrote:
> 
>> I am investigating automated methods of moving our data from the web tier
>> into HDFS for processing, a process that's performed periodically.
>> 
>> I am looking for feedback from anyone who has actually used Flume in a
>> production setup (redundant, failover) successfully.  I understand it is
>> now being largely rearchitected during its incubation as Apache Flume-NG,
>> so I don't have full confidence in the old, stable releases.
>> 
>> The other option would be to write our own tools.  What methods are you
>> using for these kinds of tasks?  Did you write your own or does Flume (or
>> something else) work for you?
>> 
>> I'm also on the Flume mailing list, but I wanted to ask these questions
>> here because I'm interested in Flume _and_ alternatives.
>> 
>> Thank you!
>> 
>>

Re: Feedback on real world production experience with Flume

Posted by "M. C. Srivas" <mc...@gmail.com>.

Karl,

 since you did ask for alternatives,  people using MapR prefer to use the
NFS access to directly deposit data (or access it).  Works seamlessly from
all Linuxes, Solaris, Windows, AIX and a myriad of other legacy systems
without having to load any agents on those machines. And it is fully
automatic HA

Since compression is built-in in MapR, the data gets compressed coming in
over NFS automatically without much fuss.

Wrt to performance,  can get about 870 MB/s per node if you have 10GigE
attached (of course, with compression, the effective throughput will
surpass that based on how good the data can be squeezed).

On Fri, Apr 20, 2012 at 3:14 PM, Karl Hennig <kh...@baynote.com> wrote:

> I am investigating automated methods of moving our data from the web tier
> into HDFS for processing, a process that's performed periodically.
>
> I am looking for feedback from anyone who has actually used Flume in a
> production setup (redundant, failover) successfully.  I understand it is
> now being largely rearchitected during its incubation as Apache Flume-NG,
> so I don't have full confidence in the old, stable releases.
>
> The other option would be to write our own tools.  What methods are you
> using for these kinds of tasks?  Did you write your own or does Flume (or
> something else) work for you?
>
> I'm also on the Flume mailing list, but I wanted to ask these questions
> here because I'm interested in Flume _and_ alternatives.
>
> Thank you!
>
>

Re: Feedback on real world production experience with Flume

Posted by alo alt <wg...@googlemail.com>.

Hi,

in my former job:
productive, Germany, Web portal. Throughput 600 mb/minute. Logfiles from Windows IIS, Apache. Used in a usual way, no own decorators or sinks. Simply syslog -> bucketing (1 minute rollover) -> hdfs splitted into minutes (YYYYMMDDHHMM). 

Stable, some issues (you'll found on the mailing list), but works well if you know what is to do when anything will happen. Btw, NG 1.1.0 is more stable as flume pre 1.x and runs in some productive environments.

- Alex

--
Alexander Lorenz
http://mapredit.blogspot.com

On Apr 21, 2012, at 12:14 AM, Karl Hennig wrote:

> I am investigating automated methods of moving our data from the web tier into HDFS for processing, a process that's performed periodically.
> 
> I am looking for feedback from anyone who has actually used Flume in a production setup (redundant, failover) successfully.  I understand it is now being largely rearchitected during its incubation as Apache Flume-NG, so I don't have full confidence in the old, stable releases.
> 
> The other option would be to write our own tools.  What methods are you using for these kinds of tasks?  Did you write your own or does Flume (or something else) work for you?
> 
> I'm also on the Flume mailing list, but I wanted to ask these questions here because I'm interested in Flume _and_ alternatives.
> 
> Thank you!
>