You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Pankaj Gupta <pa...@brightroll.com> on 2012/11/03 00:08:36 UTC

HDFS Sink log rotation on the basis of time of writing

Hi,

Is it possible to organize files written to HDFS into buckets based on the
time of writing rather than the timestamp in the header? Alternatively, is
it possible to insert the timestamp injector just before the HDFS Sink?

My use case is  to organize files such that they are organized
chronologically as well as alphabetically by name and that there is only
one file being written to at a time. This will make it easier to look for
newly available data so that MapReduce jobs can process them.

Thanks in Advance,
Pankaj

Re: HDFS Sink log rotation on the basis of time of writing

Posted by Brock Noland <br...@cloudera.com>.

Hi,

Yes you are correct, I suggest running a MR job once an hour to merge
those 60 files into one file.

Brock

On Mon, Nov 5, 2012 at 9:49 AM, Pankaj Gupta <pa...@brightroll.com> wrote:
> Hi Brock,
>
> But then if I rotate frequently e.g. every minute, the total number of files in a single folder of HDFS will go into thousands very quickly. I am not sure how/if that will affect HDFS namenode performance and I worry that it may suffer. I don't have a lot of experience with HDFS, do you happen to know if having thousands of files in a single directory in HDFS is common?
>
> Thanks,
> Pankaj
>
>
> On Nov 5, 2012, at 7:30 AM, Brock Noland <br...@cloudera.com> wrote:
>
>> Hi,
>>
>> If you just did not bucket the data at all, it would be organized by
>> the time they arrived at the sink.
>>
>> Brock
>>
>> On Fri, Nov 2, 2012 at 6:08 PM, Pankaj Gupta <pa...@brightroll.com> wrote:
>>> Hi,
>>>
>>> Is it possible to organize files written to HDFS into buckets based on the
>>> time of writing rather than the timestamp in the header? Alternatively, is
>>> it possible to insert the timestamp injector just before the HDFS Sink?
>>>
>>> My use case is  to organize files such that they are organized
>>> chronologically as well as alphabetically by name and that there is only one
>>> file being written to at a time. This will make it easier to look for newly
>>> available data so that MapReduce jobs can process them.
>>>
>>> Thanks in Advance,
>>> Pankaj
>>>
>>>
>>>
>>
>>
>>
>> --
>> Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/
>



-- 
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/

Re: HDFS Sink log rotation on the basis of time of writing

Posted by Pankaj Gupta <pa...@brightroll.com>.

Hi Brock,

But then if I rotate frequently e.g. every minute, the total number of files in a single folder of HDFS will go into thousands very quickly. I am not sure how/if that will affect HDFS namenode performance and I worry that it may suffer. I don't have a lot of experience with HDFS, do you happen to know if having thousands of files in a single directory in HDFS is common?

Thanks,
Pankaj

On Nov 5, 2012, at 7:30 AM, Brock Noland <br...@cloudera.com> wrote:

> Hi,
> 
> If you just did not bucket the data at all, it would be organized by
> the time they arrived at the sink.
> 
> Brock
> 
> On Fri, Nov 2, 2012 at 6:08 PM, Pankaj Gupta <pa...@brightroll.com> wrote:
>> Hi,
>> 
>> Is it possible to organize files written to HDFS into buckets based on the
>> time of writing rather than the timestamp in the header? Alternatively, is
>> it possible to insert the timestamp injector just before the HDFS Sink?
>> 
>> My use case is  to organize files such that they are organized
>> chronologically as well as alphabetically by name and that there is only one
>> file being written to at a time. This will make it easier to look for newly
>> available data so that MapReduce jobs can process them.
>> 
>> Thanks in Advance,
>> Pankaj
>> 
>> 
>> 
> 
> 
> 
> -- 
> Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/

Re: HDFS Sink log rotation on the basis of time of writing

Posted by Brock Noland <br...@cloudera.com>.

Hi,

If you just did not bucket the data at all, it would be organized by
the time they arrived at the sink.

Brock

On Fri, Nov 2, 2012 at 6:08 PM, Pankaj Gupta <pa...@brightroll.com> wrote:
> Hi,
>
> Is it possible to organize files written to HDFS into buckets based on the
> time of writing rather than the timestamp in the header? Alternatively, is
> it possible to insert the timestamp injector just before the HDFS Sink?
>
> My use case is  to organize files such that they are organized
> chronologically as well as alphabetically by name and that there is only one
> file being written to at a time. This will make it easier to look for newly
> available data so that MapReduce jobs can process them.
>
> Thanks in Advance,
> Pankaj
>
>
>



-- 
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/