You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Chen Wang <ch...@gmail.com> on 2014/01/20 19:54:39 UTC

best way to make all hdfs records in one file under a folder?

Guys,
I have flume setup to flow partitioned data to hdfs, each partition has its
own file folder. Is there a way to specify all the data under one partition
to be in one file?
I am currently using
MyAgent.sinks.HDFS.hdfs.batchSize = 10000
MyAgent.sinks.HDFS.hdfs.rollSize = 15000000
MyAgent.sinks.HDFS.hdfs.rollCount = 10000
MyAgent.sinks.HDFS.hdfs.rollInterval = 360

to make the file roll on 15m data or after 6 minute.

Is this the best way to achieve my goal?
Thanks,
Chen

Re: best way to make all hdfs records in one file under a folder?

Posted by Jeff Lord <jl...@cloudera.com>.

If you don't intend to roll based on # of events than you will want to set
rollCount to 0.
MyAgent.sinks.HDFS.hdfs.rollCount = 0


On Mon, Jan 20, 2014 at 12:35 PM, Jimmy <ji...@gmail.com> wrote:

> Seems like the only reason is "too many files" issue, correct?
>
> File Crusher executed regularly might be better option than trying to tune
> it in flume
>
> http://www.jointhegrid.com/hadoop_filecrush/index.jsp
>
>
>
> ---------- Forwarded message ----------
> From: Chen Wang <ch...@gmail.com>
> Date: Mon, Jan 20, 2014 at 11:21 AM
> Subject: Re: best way to make all hdfs records in one file under a folder?
> To: user@flume.apache.org
>
>
> Chris,
> Its by every 6 minutes(thats why i set the roll time to be 60*5=300. the
> data size is around 15M. Thus I want them all in one file.
> Chen
>
>
> On Mon, Jan 20, 2014 at 10:57 AM, Christopher Shannon <
> cshannon108@gmail.com> wrote:
>
>> How is your data partitioned, by date?
>>
>>
>> On Monday, January 20, 2014, Chen Wang <ch...@gmail.com>
>> wrote:
>>
>>> Guys,
>>> I have flume setup to flow partitioned data to hdfs, each partition has
>>> its own file folder. Is there a way to specify all the data under one
>>> partition to be in one file?
>>> I am currently using
>>> MyAgent.sinks.HDFS.hdfs.batchSize = 10000
>>> MyAgent.sinks.HDFS.hdfs.rollSize = 15000000
>>> MyAgent.sinks.HDFS.hdfs.rollCount = 10000
>>> MyAgent.sinks.HDFS.hdfs.rollInterval = 360
>>>
>>> to make the file roll on 15m data or after 6 minute.
>>>
>>> Is this the best way to achieve my goal?
>>> Thanks,
>>> Chen
>>>
>>>
>
>
>

Fwd: best way to make all hdfs records in one file under a folder?

Posted by Jimmy <ji...@gmail.com>.

Seems like the only reason is "too many files" issue, correct?

File Crusher executed regularly might be better option than trying to tune
it in flume

http://www.jointhegrid.com/hadoop_filecrush/index.jsp



---------- Forwarded message ----------
From: Chen Wang <ch...@gmail.com>
Date: Mon, Jan 20, 2014 at 11:21 AM
Subject: Re: best way to make all hdfs records in one file under a folder?
To: user@flume.apache.org


Chris,
Its by every 6 minutes(thats why i set the roll time to be 60*5=300. the
data size is around 15M. Thus I want them all in one file.
Chen


On Mon, Jan 20, 2014 at 10:57 AM, Christopher Shannon <cshannon108@gmail.com
> wrote:

> How is your data partitioned, by date?
>
>
> On Monday, January 20, 2014, Chen Wang <ch...@gmail.com> wrote:
>
>> Guys,
>> I have flume setup to flow partitioned data to hdfs, each partition has
>> its own file folder. Is there a way to specify all the data under one
>> partition to be in one file?
>> I am currently using
>> MyAgent.sinks.HDFS.hdfs.batchSize = 10000
>> MyAgent.sinks.HDFS.hdfs.rollSize = 15000000
>> MyAgent.sinks.HDFS.hdfs.rollCount = 10000
>> MyAgent.sinks.HDFS.hdfs.rollInterval = 360
>>
>> to make the file roll on 15m data or after 6 minute.
>>
>> Is this the best way to achieve my goal?
>> Thanks,
>> Chen
>>
>>

Re: best way to make all hdfs records in one file under a folder?

Posted by Chen Wang <ch...@gmail.com>.

Chris,
Its by every 6 minutes(thats why i set the roll time to be 60*5=300. the
data size is around 15M. Thus I want them all in one file.
Chen


On Mon, Jan 20, 2014 at 10:57 AM, Christopher Shannon <cshannon108@gmail.com
> wrote:

> How is your data partitioned, by date?
>
>
> On Monday, January 20, 2014, Chen Wang <ch...@gmail.com> wrote:
>
>> Guys,
>> I have flume setup to flow partitioned data to hdfs, each partition has
>> its own file folder. Is there a way to specify all the data under one
>> partition to be in one file?
>> I am currently using
>> MyAgent.sinks.HDFS.hdfs.batchSize = 10000
>> MyAgent.sinks.HDFS.hdfs.rollSize = 15000000
>> MyAgent.sinks.HDFS.hdfs.rollCount = 10000
>> MyAgent.sinks.HDFS.hdfs.rollInterval = 360
>>
>> to make the file roll on 15m data or after 6 minute.
>>
>> Is this the best way to achieve my goal?
>> Thanks,
>> Chen
>>
>>

Re: best way to make all hdfs records in one file under a folder?

Posted by Christopher Shannon <cs...@gmail.com>.

How is your data partitioned, by date?

On Monday, January 20, 2014, Chen Wang <ch...@gmail.com> wrote:

> Guys,
> I have flume setup to flow partitioned data to hdfs, each partition has
> its own file folder. Is there a way to specify all the data under one
> partition to be in one file?
> I am currently using
> MyAgent.sinks.HDFS.hdfs.batchSize = 10000
> MyAgent.sinks.HDFS.hdfs.rollSize = 15000000
> MyAgent.sinks.HDFS.hdfs.rollCount = 10000
> MyAgent.sinks.HDFS.hdfs.rollInterval = 360
>
> to make the file roll on 15m data or after 6 minute.
>
> Is this the best way to achieve my goal?
> Thanks,
> Chen
>
>