You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Jagadish Bihani <ja...@pubmatic.com> on 2012/09/27 05:23:28 UTC

HDFS sink Bucketwriter working

Hi

I had few doubts about HDFS sink Bucketwriter :

-- How does HDFS's bucketwriter works? What criteria does it use to create
another bucket?

-- Creation of a file in HDFS is function of how many parameters ? Initially
I thought it is function of only rolling parameter(interval/size). But 
apparently
it is also function 'batchsize' and 'txnEventMax'.

-- If my requirement is that; If I get data from 10 Avro sinks to  a 
single avro source and
I want to dump it to HDFS with fixed size (say 64 MB) file. What should 
I do?
Presently If I set it 64 MB rolling size; Bucketwriter creates many 
files ( I suspect it
is = trxEventMax) and after a while it throws exceptions like 'too many 
open files'. (I have limit of
75000 open file descriptors).

Information about above things will be of great help to tune flume 
properly for the requirements.

Reagards,
Jagadish

Re: HDFS sink Bucketwriter working

Posted by Mike Percy <mp...@apache.org>.

Are you sure the files are remaining open? How do you know this? Are you
saying that the .tmp files remain?

Can you please post your flume.conf file and also provide the exact version
of flume that you are running? If your version of Flume is modern enough,
please also post the output of the "flume-ng version" command.

Regards,
Mike


On Wed, Sep 26, 2012 at 11:58 PM, Jagadish Bihani <
jagadish.bihani@pubmatic.com> wrote:

>  Hi
>
> Thanks for the reply Mike.
>
> -- I have been following the user guide.
>
> -- Actually I didn't get the expected behaviour with rolling as per the
> guide.
> (i.e. whenever I set rolling size to 10 MB and other rolling params to 0)
> I would expect that all
> the incoming events will get into this single file until it reaches to the
> size 10 MB and then
> next events will go to next file and so on. But it simultaneously opens
> many files
> at the same time which I thought related to params like transEventMax and
> batchSize.
>
> -- Hence I started going through the source code and came across
> few questions mentioned in the mail below.  I had posted exceptions which
> I got in other threads. But I think even if I get to know the inner working
> of BucketWriter class that will help to solve my troubles.
>
> Regards,
>
>
>
>
> On 09/27/2012 12:19 PM, Mike Percy wrote:
>
> Jagadish,
> Refer to the user guide here:
> http://flume.apache.org/FlumeUserGuide.html#hdfs-sink
>
>  Note the defaults for rollInterval, rollSize, and rollCount. If you want
> to use rollSize only, then you should set the others to 0.
>
>  Also worth mentioning setting batchSize to something larger if you want
> to maximize your performance. I often go with 1000, depending on the
> application you may want to go lower or higher.
>
>  Regards,
> Mike
>
>
> On Wed, Sep 26, 2012 at 8:23 PM, Jagadish Bihani <
> jagadish.bihani@pubmatic.com> wrote:
>
>> Hi
>>
>> I had few doubts about HDFS sink Bucketwriter :
>>
>> -- How does HDFS's bucketwriter works? What criteria does it use to create
>> another bucket?
>>
>> -- Creation of a file in HDFS is function of how many parameters ?
>> Initially
>> I thought it is function of only rolling parameter(interval/size). But
>> apparently
>> it is also function 'batchsize' and 'txnEventMax'.
>>
>> -- If my requirement is that; If I get data from 10 Avro sinks to  a
>> single avro source and
>> I want to dump it to HDFS with fixed size (say 64 MB) file. What should I
>> do?
>> Presently If I set it 64 MB rolling size; Bucketwriter creates many files
>> ( I suspect it
>> is = trxEventMax) and after a while it throws exceptions like 'too many
>> open files'. (I have limit of
>> 75000 open file descriptors).
>>
>> Information about above things will be of great help to tune flume
>> properly for the requirements.
>>
>> Reagards,
>> Jagadish
>>
>>
>
>

Re: HDFS sink Bucketwriter working

Posted by Jagadish Bihani <ja...@pubmatic.com>.

Hi

Thanks for the reply Mike.

-- I have been following the user guide.

-- Actually I didn't get the expected behaviour with rolling as per the 
guide.
(i.e. whenever I set rolling size to 10 MB and other rolling params to 
0) I would expect that all
the incoming events will get into this single file until it reaches to 
the size 10 MB and then
next events will go to next file and so on. But it simultaneously opens 
many files
at the same time which I thought related to params like transEventMax 
and batchSize.

-- Hence I started going through the source code and came across
few questions mentioned in the mail below.  I had posted exceptions which
I got in other threads. But I think even if I get to know the inner working
of BucketWriter class that will help to solve my troubles.

Regards,



On 09/27/2012 12:19 PM, Mike Percy wrote:
> Jagadish,
> Refer to the user guide here: 
> http://flume.apache.org/FlumeUserGuide.html#hdfs-sink
>
> Note the defaults for rollInterval, rollSize, and rollCount. If you 
> want to use rollSize only, then you should set the others to 0.
>
> Also worth mentioning setting batchSize to something larger if you 
> want to maximize your performance. I often go with 1000, depending on 
> the application you may want to go lower or higher.
>
> Regards,
> Mike
>
>
> On Wed, Sep 26, 2012 at 8:23 PM, Jagadish Bihani 
> <jagadish.bihani@pubmatic.com <ma...@pubmatic.com>> 
> wrote:
>
>     Hi
>
>     I had few doubts about HDFS sink Bucketwriter :
>
>     -- How does HDFS's bucketwriter works? What criteria does it use
>     to create
>     another bucket?
>
>     -- Creation of a file in HDFS is function of how many parameters ?
>     Initially
>     I thought it is function of only rolling parameter(interval/size).
>     But apparently
>     it is also function 'batchsize' and 'txnEventMax'.
>
>     -- If my requirement is that; If I get data from 10 Avro sinks to
>      a single avro source and
>     I want to dump it to HDFS with fixed size (say 64 MB) file. What
>     should I do?
>     Presently If I set it 64 MB rolling size; Bucketwriter creates
>     many files ( I suspect it
>     is = trxEventMax) and after a while it throws exceptions like 'too
>     many open files'. (I have limit of
>     75000 open file descriptors).
>
>     Information about above things will be of great help to tune flume
>     properly for the requirements.
>
>     Reagards,
>     Jagadish
>
>

Re: HDFS sink Bucketwriter working

Posted by Mike Percy <mp...@apache.org>.

Jagadish,
Refer to the user guide here:
http://flume.apache.org/FlumeUserGuide.html#hdfs-sink

Note the defaults for rollInterval, rollSize, and rollCount. If you want to
use rollSize only, then you should set the others to 0.

Also worth mentioning setting batchSize to something larger if you want to
maximize your performance. I often go with 1000, depending on the
application you may want to go lower or higher.

Regards,
Mike


On Wed, Sep 26, 2012 at 8:23 PM, Jagadish Bihani <
jagadish.bihani@pubmatic.com> wrote:

> Hi
>
> I had few doubts about HDFS sink Bucketwriter :
>
> -- How does HDFS's bucketwriter works? What criteria does it use to create
> another bucket?
>
> -- Creation of a file in HDFS is function of how many parameters ?
> Initially
> I thought it is function of only rolling parameter(interval/size). But
> apparently
> it is also function 'batchsize' and 'txnEventMax'.
>
> -- If my requirement is that; If I get data from 10 Avro sinks to  a
> single avro source and
> I want to dump it to HDFS with fixed size (say 64 MB) file. What should I
> do?
> Presently If I set it 64 MB rolling size; Bucketwriter creates many files
> ( I suspect it
> is = trxEventMax) and after a while it throws exceptions like 'too many
> open files'. (I have limit of
> 75000 open file descriptors).
>
> Information about above things will be of great help to tune flume
> properly for the requirements.
>
> Reagards,
> Jagadish
>
>