You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Mohit Anchlia <mo...@gmail.com> on 2012/11/11 17:32:28 UTC

.tmp in hdfs sink

What we are seeing is that if flume gets killed either because of server
failure or other reasons, it keeps around the .tmp file. Sometimes for
whatever reasons .tmp file is not readable. Is there a way to rollover .tmp
file more gracefully?

Re: .tmp in hdfs sink

Posted by Alexander Alten-Lorenz <wg...@gmail.com>.

I have made this in past per minute rolls (YY/MM/DD/HH/MM) and closed a sink after 30 secs. This matched in my cases mostly perfect. But depends on your use case.

Cheers,
 Alex

On Nov 16, 2012, at 5:16 AM, Mohit Anchlia <mo...@gmail.com> wrote:

> Another question I had was about rollover. What's the best way to rollover
> files in reasonable timeframe? For instance our path is YY/MM/DD/HH so
> every hour there is new file and the -1 hr is just sitting with .tmp and it
> takes sometimes even hour before .tmp is closed and renamed to .snappy. In
> this situation is there a way to tell flume to rollover files sooner based
> on some idle time limit?
> 
> On Thu, Nov 15, 2012 at 8:14 PM, Mohit Anchlia <mo...@gmail.com>wrote:
> 
>> Thanks Mike it makes sense. Anyway I can help?
>> 
>> 
>> On Thu, Nov 15, 2012 at 11:54 AM, Mike Percy <mp...@apache.org> wrote:
>> 
>>> Hi Mohit, this is a complicated issue. I've filed
>>> https://issues.apache.org/jira/browse/FLUME-1714 to track it.
>>> 
>>> In short, it would require a non-trivial amount of work to implement
>>> this, and it would need to be done carefully. I agree that it would be
>>> better if Flume handled this case more gracefully than it does today.
>>> Today, Flume assumes that you have some job that would go and clean up the
>>> .tmp files as needed, and that you understand that they could be partially
>>> written if a crash occurred.
>>> 
>>> Regards,
>>> Mike
>>> 
>>> 
>>> On Sun, Nov 11, 2012 at 8:32 AM, Mohit Anchlia <mo...@gmail.com>wrote:
>>> 
>>>> What we are seeing is that if flume gets killed either because of server
>>>> failure or other reasons, it keeps around the .tmp file. Sometimes for
>>>> whatever reasons .tmp file is not readable. Is there a way to rollover .tmp
>>>> file more gracefully?
>>>> 
>>> 
>>> 
>> 

--
Alexander Alten-Lorenz
http://mapredit.blogspot.com
German Hadoop LinkedIn Group: http://goo.gl/N8pCF

Re: .tmp in hdfs sink

Posted by Mohit Anchlia <mo...@gmail.com>.

Thanks for your response so far. I checkedout flume-1.3.0 and have built
it. My next question is the property hdfs.closeIdleTimeout correct? Do I
need to set any other property? My current config looks like and I write by
YYYY/MM/DD/HH format so essentially I get 1-2 files per hour.


webanalytics.sinks.hdfsSink.hdfs.filePrefix = web

webanalytics.sinks.hdfsSink.hdfs.rollInterval = 4000

webanalytics.sinks.hdfsSink.hdfs.rollCount = 20000000

#webanalytics.sinks.hdfsSink.hdfs.rollCount = 40000

webanalytics.sinks.hdfsSink.hdfs.rollSize = 15000000000

webanalytics.sinks.hdfsSink.hdfs.fileType = SequenceFile

webanalytics.sinks.hdfsSink.hdfs.writeFormat = Text

webanalytics.sinks.hdfsSink.hdfs.codeC = snappy


On Wed, Nov 28, 2012 at 9:20 PM, Juhani Connolly <
juhani_connolly@cyberagent.co.jp> wrote:

>  The changes are in both the 1.3 RC5 and in the 1.4 trunk
>
>
> On 11/29/2012 01:26 PM, Mohit Anchlia wrote:
>
> If I grab the last snapshot would I get these changes?
>
> On Tue, Nov 20, 2012 at 3:24 PM, Mohit Anchlia <mo...@gmail.com>wrote:
>
>> that's awesome!
>>
>>
>> On Tue, Nov 20, 2012 at 3:11 PM, Mike Percy <mp...@apache.org> wrote:
>>
>>> Mohit,
>>> No problem, but Juhani did all the work. :)
>>>
>>> The behavior is that you can configure an HDFS sink to close a file if
>>> it hasn't gotten any writes in some time. After it's been idle for 5
>>> minutes or something, it gets closed. If you get a "late" event that goes
>>> to the same path after the file is closed, it will just create a new file
>>> in the same path as usual.
>>>
>>> Regards,
>>> Mike
>>>
>>>
>>> On Tue, Nov 20, 2012 at 12:56 PM, Brock Noland <br...@cloudera.com>wrote:
>>>
>>>> We are currently voting on a 1.3.0 RC on the dev@ list:
>>>>
>>>> http://s.apache.org/OQ0W
>>>>
>>>> You don't have to be a committer to vote! :)
>>>>
>>>> Brock
>>>>
>>>> On Tue, Nov 20, 2012 at 2:53 PM, Mohit Anchlia <mo...@gmail.com>
>>>> wrote:
>>>> > Thanks a lot!! Now with this what should be the expected behaviour?
>>>> After
>>>> > file is closed a new file is created for writes that come after
>>>> closing the
>>>> > file?
>>>> >
>>>> > Thanks again for committing this change. Do you know when 1.3.0 is
>>>> out? I am
>>>> > currently using the snapshot version of 1.3.0
>>>> >
>>>> > On Tue, Nov 20, 2012 at 11:16 AM, Mike Percy <mp...@apache.org>
>>>> wrote:
>>>> >>
>>>> >> Mohit,
>>>> >> FLUME-1660 is now committed and it will be in 1.3.0. In the case
>>>> where you
>>>> >> are using 1.2.0, I suggest running with hdfs.rollInterval set so the
>>>> files
>>>> >> will roll normally.
>>>> >>
>>>> >> Regards,
>>>> >> Mike
>>>> >>
>>>> >>
>>>> >> On Thu, Nov 15, 2012 at 11:23 PM, Juhani Connolly
>>>> >> <ju...@cyberagent.co.jp> wrote:
>>>> >>>
>>>> >>> I am actually working on a patch for exactly this, refer to
>>>> FLUME-1660
>>>> >>>
>>>> >>> The patch is on review board right now, I fixed a corner case issue
>>>> that
>>>> >>> came up with unit testing, but the implementation is not really to
>>>> my
>>>> >>> satisfaction. If you are interested please have a look and add your
>>>> opinion.
>>>> >>>
>>>> >>> https://issues.apache.org/jira/browse/FLUME-1660
>>>> >>> https://reviews.apache.org/r/7659/
>>>> >>>
>>>> >>>
>>>> >>> On 11/16/2012 01:16 PM, Mohit Anchlia wrote:
>>>> >>>
>>>> >>> Another question I had was about rollover. What's the best way to
>>>> >>> rollover files in reasonable timeframe? For instance our path is
>>>> YY/MM/DD/HH
>>>> >>> so every hour there is new file and the -1 hr is just sitting with
>>>> .tmp and
>>>> >>> it takes sometimes even hour before .tmp is closed and renamed to
>>>> .snappy.
>>>> >>> In this situation is there a way to tell flume to rollover files
>>>> sooner
>>>> >>> based on some idle time limit?
>>>> >>>
>>>> >>> On Thu, Nov 15, 2012 at 8:14 PM, Mohit Anchlia <
>>>> mohitanchlia@gmail.com>
>>>> >>> wrote:
>>>> >>>>
>>>> >>>> Thanks Mike it makes sense. Anyway I can help?
>>>> >>>>
>>>> >>>>
>>>> >>>> On Thu, Nov 15, 2012 at 11:54 AM, Mike Percy <mp...@apache.org>
>>>> wrote:
>>>> >>>>>
>>>> >>>>> Hi Mohit, this is a complicated issue. I've filed
>>>> >>>>> https://issues.apache.org/jira/browse/FLUME-1714 to track it.
>>>> >>>>>
>>>> >>>>> In short, it would require a non-trivial amount of work to
>>>> implement
>>>> >>>>> this, and it would need to be done carefully. I agree that it
>>>> would be
>>>> >>>>> better if Flume handled this case more gracefully than it does
>>>> today. Today,
>>>> >>>>> Flume assumes that you have some job that would go and clean up
>>>> the .tmp
>>>> >>>>> files as needed, and that you understand that they could be
>>>> partially
>>>> >>>>> written if a crash occurred.
>>>> >>>>>
>>>> >>>>> Regards,
>>>> >>>>> Mike
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> On Sun, Nov 11, 2012 at 8:32 AM, Mohit Anchlia <
>>>> mohitanchlia@gmail.com>
>>>> >>>>> wrote:
>>>> >>>>>>
>>>> >>>>>> What we are seeing is that if flume gets killed either because of
>>>> >>>>>> server failure or other reasons, it keeps around the .tmp file.
>>>> Sometimes
>>>> >>>>>> for whatever reasons .tmp file is not readable. Is there a way
>>>> to rollover
>>>> >>>>>> .tmp file more gracefully?
>>>> >>>>>
>>>> >>>>>
>>>> >>>>
>>>> >>>
>>>> >>>
>>>> >>
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Apache MRUnit - Unit testing MapReduce -
>>>> http://incubator.apache.org/mrunit/
>>>>
>>>
>>>
>>
>
>

Re: .tmp in hdfs sink

Posted by Juhani Connolly <ju...@cyberagent.co.jp>.

The changes are in both the 1.3 RC5 and in the 1.4 trunk

On 11/29/2012 01:26 PM, Mohit Anchlia wrote:
> If I grab the last snapshot would I get these changes?
>
> On Tue, Nov 20, 2012 at 3:24 PM, Mohit Anchlia <mohitanchlia@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     that's awesome!
>
>
>     On Tue, Nov 20, 2012 at 3:11 PM, Mike Percy <mpercy@apache.org
>     <ma...@apache.org>> wrote:
>
>         Mohit,
>         No problem, but Juhani did all the work. :)
>
>         The behavior is that you can configure an HDFS sink to close a
>         file if it hasn't gotten any writes in some time. After it's
>         been idle for 5 minutes or something, it gets closed. If you
>         get a "late" event that goes to the same path after the file
>         is closed, it will just create a new file in the same path as
>         usual.
>
>         Regards,
>         Mike
>
>
>         On Tue, Nov 20, 2012 at 12:56 PM, Brock Noland
>         <brock@cloudera.com <ma...@cloudera.com>> wrote:
>
>             We are currently voting on a 1.3.0 RC on the dev@ list:
>
>             http://s.apache.org/OQ0W
>
>             You don't have to be a committer to vote! :)
>
>             Brock
>
>             On Tue, Nov 20, 2012 at 2:53 PM, Mohit Anchlia
>             <mohitanchlia@gmail.com <ma...@gmail.com>>
>             wrote:
>             > Thanks a lot!! Now with this what should be the expected
>             behaviour? After
>             > file is closed a new file is created for writes that
>             come after closing the
>             > file?
>             >
>             > Thanks again for committing this change. Do you know
>             when 1.3.0 is out? I am
>             > currently using the snapshot version of 1.3.0
>             >
>             > On Tue, Nov 20, 2012 at 11:16 AM, Mike Percy
>             <mpercy@apache.org <ma...@apache.org>> wrote:
>             >>
>             >> Mohit,
>             >> FLUME-1660 is now committed and it will be in 1.3.0. In
>             the case where you
>             >> are using 1.2.0, I suggest running with
>             hdfs.rollInterval set so the files
>             >> will roll normally.
>             >>
>             >> Regards,
>             >> Mike
>             >>
>             >>
>             >> On Thu, Nov 15, 2012 at 11:23 PM, Juhani Connolly
>             >> <juhani_connolly@cyberagent.co.jp
>             <ma...@cyberagent.co.jp>> wrote:
>             >>>
>             >>> I am actually working on a patch for exactly this,
>             refer to FLUME-1660
>             >>>
>             >>> The patch is on review board right now, I fixed a
>             corner case issue that
>             >>> came up with unit testing, but the implementation is
>             not really to my
>             >>> satisfaction. If you are interested please have a look
>             and add your opinion.
>             >>>
>             >>> https://issues.apache.org/jira/browse/FLUME-1660
>             >>> https://reviews.apache.org/r/7659/
>             >>>
>             >>>
>             >>> On 11/16/2012 01:16 PM, Mohit Anchlia wrote:
>             >>>
>             >>> Another question I had was about rollover. What's the
>             best way to
>             >>> rollover files in reasonable timeframe? For instance
>             our path is YY/MM/DD/HH
>             >>> so every hour there is new file and the -1 hr is just
>             sitting with .tmp and
>             >>> it takes sometimes even hour before .tmp is closed and
>             renamed to .snappy.
>             >>> In this situation is there a way to tell flume to
>             rollover files sooner
>             >>> based on some idle time limit?
>             >>>
>             >>> On Thu, Nov 15, 2012 at 8:14 PM, Mohit Anchlia
>             <mohitanchlia@gmail.com <ma...@gmail.com>>
>             >>> wrote:
>             >>>>
>             >>>> Thanks Mike it makes sense. Anyway I can help?
>             >>>>
>             >>>>
>             >>>> On Thu, Nov 15, 2012 at 11:54 AM, Mike Percy
>             <mpercy@apache.org <ma...@apache.org>> wrote:
>             >>>>>
>             >>>>> Hi Mohit, this is a complicated issue. I've filed
>             >>>>> https://issues.apache.org/jira/browse/FLUME-1714 to
>             track it.
>             >>>>>
>             >>>>> In short, it would require a non-trivial amount of
>             work to implement
>             >>>>> this, and it would need to be done carefully. I
>             agree that it would be
>             >>>>> better if Flume handled this case more gracefully
>             than it does today. Today,
>             >>>>> Flume assumes that you have some job that would go
>             and clean up the .tmp
>             >>>>> files as needed, and that you understand that they
>             could be partially
>             >>>>> written if a crash occurred.
>             >>>>>
>             >>>>> Regards,
>             >>>>> Mike
>             >>>>>
>             >>>>>
>             >>>>> On Sun, Nov 11, 2012 at 8:32 AM, Mohit Anchlia
>             <mohitanchlia@gmail.com <ma...@gmail.com>>
>             >>>>> wrote:
>             >>>>>>
>             >>>>>> What we are seeing is that if flume gets killed
>             either because of
>             >>>>>> server failure or other reasons, it keeps around
>             the .tmp file. Sometimes
>             >>>>>> for whatever reasons .tmp file is not readable. Is
>             there a way to rollover
>             >>>>>> .tmp file more gracefully?
>             >>>>>
>             >>>>>
>             >>>>
>             >>>
>             >>>
>             >>
>             >
>
>
>
>             --
>             Apache MRUnit - Unit testing MapReduce -
>             http://incubator.apache.org/mrunit/
>
>
>
>

Re: .tmp in hdfs sink

Posted by Mohit Anchlia <mo...@gmail.com>.

If I grab the last snapshot would I get these changes?

On Tue, Nov 20, 2012 at 3:24 PM, Mohit Anchlia <mo...@gmail.com>wrote:

> that's awesome!
>
>
> On Tue, Nov 20, 2012 at 3:11 PM, Mike Percy <mp...@apache.org> wrote:
>
>> Mohit,
>> No problem, but Juhani did all the work. :)
>>
>> The behavior is that you can configure an HDFS sink to close a file if it
>> hasn't gotten any writes in some time. After it's been idle for 5 minutes
>> or something, it gets closed. If you get a "late" event that goes to the
>> same path after the file is closed, it will just create a new file in the
>> same path as usual.
>>
>> Regards,
>> Mike
>>
>>
>> On Tue, Nov 20, 2012 at 12:56 PM, Brock Noland <br...@cloudera.com>wrote:
>>
>>> We are currently voting on a 1.3.0 RC on the dev@ list:
>>>
>>> http://s.apache.org/OQ0W
>>>
>>> You don't have to be a committer to vote! :)
>>>
>>> Brock
>>>
>>> On Tue, Nov 20, 2012 at 2:53 PM, Mohit Anchlia <mo...@gmail.com>
>>> wrote:
>>> > Thanks a lot!! Now with this what should be the expected behaviour?
>>> After
>>> > file is closed a new file is created for writes that come after
>>> closing the
>>> > file?
>>> >
>>> > Thanks again for committing this change. Do you know when 1.3.0 is
>>> out? I am
>>> > currently using the snapshot version of 1.3.0
>>> >
>>> > On Tue, Nov 20, 2012 at 11:16 AM, Mike Percy <mp...@apache.org>
>>> wrote:
>>> >>
>>> >> Mohit,
>>> >> FLUME-1660 is now committed and it will be in 1.3.0. In the case
>>> where you
>>> >> are using 1.2.0, I suggest running with hdfs.rollInterval set so the
>>> files
>>> >> will roll normally.
>>> >>
>>> >> Regards,
>>> >> Mike
>>> >>
>>> >>
>>> >> On Thu, Nov 15, 2012 at 11:23 PM, Juhani Connolly
>>> >> <ju...@cyberagent.co.jp> wrote:
>>> >>>
>>> >>> I am actually working on a patch for exactly this, refer to
>>> FLUME-1660
>>> >>>
>>> >>> The patch is on review board right now, I fixed a corner case issue
>>> that
>>> >>> came up with unit testing, but the implementation is not really to my
>>> >>> satisfaction. If you are interested please have a look and add your
>>> opinion.
>>> >>>
>>> >>> https://issues.apache.org/jira/browse/FLUME-1660
>>> >>> https://reviews.apache.org/r/7659/
>>> >>>
>>> >>>
>>> >>> On 11/16/2012 01:16 PM, Mohit Anchlia wrote:
>>> >>>
>>> >>> Another question I had was about rollover. What's the best way to
>>> >>> rollover files in reasonable timeframe? For instance our path is
>>> YY/MM/DD/HH
>>> >>> so every hour there is new file and the -1 hr is just sitting with
>>> .tmp and
>>> >>> it takes sometimes even hour before .tmp is closed and renamed to
>>> .snappy.
>>> >>> In this situation is there a way to tell flume to rollover files
>>> sooner
>>> >>> based on some idle time limit?
>>> >>>
>>> >>> On Thu, Nov 15, 2012 at 8:14 PM, Mohit Anchlia <
>>> mohitanchlia@gmail.com>
>>> >>> wrote:
>>> >>>>
>>> >>>> Thanks Mike it makes sense. Anyway I can help?
>>> >>>>
>>> >>>>
>>> >>>> On Thu, Nov 15, 2012 at 11:54 AM, Mike Percy <mp...@apache.org>
>>> wrote:
>>> >>>>>
>>> >>>>> Hi Mohit, this is a complicated issue. I've filed
>>> >>>>> https://issues.apache.org/jira/browse/FLUME-1714 to track it.
>>> >>>>>
>>> >>>>> In short, it would require a non-trivial amount of work to
>>> implement
>>> >>>>> this, and it would need to be done carefully. I agree that it
>>> would be
>>> >>>>> better if Flume handled this case more gracefully than it does
>>> today. Today,
>>> >>>>> Flume assumes that you have some job that would go and clean up
>>> the .tmp
>>> >>>>> files as needed, and that you understand that they could be
>>> partially
>>> >>>>> written if a crash occurred.
>>> >>>>>
>>> >>>>> Regards,
>>> >>>>> Mike
>>> >>>>>
>>> >>>>>
>>> >>>>> On Sun, Nov 11, 2012 at 8:32 AM, Mohit Anchlia <
>>> mohitanchlia@gmail.com>
>>> >>>>> wrote:
>>> >>>>>>
>>> >>>>>> What we are seeing is that if flume gets killed either because of
>>> >>>>>> server failure or other reasons, it keeps around the .tmp file.
>>> Sometimes
>>> >>>>>> for whatever reasons .tmp file is not readable. Is there a way to
>>> rollover
>>> >>>>>> .tmp file more gracefully?
>>> >>>>>
>>> >>>>>
>>> >>>>
>>> >>>
>>> >>>
>>> >>
>>> >
>>>
>>>
>>>
>>> --
>>> Apache MRUnit - Unit testing MapReduce -
>>> http://incubator.apache.org/mrunit/
>>>
>>
>>
>

Re: .tmp in hdfs sink

Posted by Mohit Anchlia <mo...@gmail.com>.

that's awesome!

On Tue, Nov 20, 2012 at 3:11 PM, Mike Percy <mp...@apache.org> wrote:

> Mohit,
> No problem, but Juhani did all the work. :)
>
> The behavior is that you can configure an HDFS sink to close a file if it
> hasn't gotten any writes in some time. After it's been idle for 5 minutes
> or something, it gets closed. If you get a "late" event that goes to the
> same path after the file is closed, it will just create a new file in the
> same path as usual.
>
> Regards,
> Mike
>
>
> On Tue, Nov 20, 2012 at 12:56 PM, Brock Noland <br...@cloudera.com> wrote:
>
>> We are currently voting on a 1.3.0 RC on the dev@ list:
>>
>> http://s.apache.org/OQ0W
>>
>> You don't have to be a committer to vote! :)
>>
>> Brock
>>
>> On Tue, Nov 20, 2012 at 2:53 PM, Mohit Anchlia <mo...@gmail.com>
>> wrote:
>> > Thanks a lot!! Now with this what should be the expected behaviour?
>> After
>> > file is closed a new file is created for writes that come after closing
>> the
>> > file?
>> >
>> > Thanks again for committing this change. Do you know when 1.3.0 is out?
>> I am
>> > currently using the snapshot version of 1.3.0
>> >
>> > On Tue, Nov 20, 2012 at 11:16 AM, Mike Percy <mp...@apache.org> wrote:
>> >>
>> >> Mohit,
>> >> FLUME-1660 is now committed and it will be in 1.3.0. In the case where
>> you
>> >> are using 1.2.0, I suggest running with hdfs.rollInterval set so the
>> files
>> >> will roll normally.
>> >>
>> >> Regards,
>> >> Mike
>> >>
>> >>
>> >> On Thu, Nov 15, 2012 at 11:23 PM, Juhani Connolly
>> >> <ju...@cyberagent.co.jp> wrote:
>> >>>
>> >>> I am actually working on a patch for exactly this, refer to FLUME-1660
>> >>>
>> >>> The patch is on review board right now, I fixed a corner case issue
>> that
>> >>> came up with unit testing, but the implementation is not really to my
>> >>> satisfaction. If you are interested please have a look and add your
>> opinion.
>> >>>
>> >>> https://issues.apache.org/jira/browse/FLUME-1660
>> >>> https://reviews.apache.org/r/7659/
>> >>>
>> >>>
>> >>> On 11/16/2012 01:16 PM, Mohit Anchlia wrote:
>> >>>
>> >>> Another question I had was about rollover. What's the best way to
>> >>> rollover files in reasonable timeframe? For instance our path is
>> YY/MM/DD/HH
>> >>> so every hour there is new file and the -1 hr is just sitting with
>> .tmp and
>> >>> it takes sometimes even hour before .tmp is closed and renamed to
>> .snappy.
>> >>> In this situation is there a way to tell flume to rollover files
>> sooner
>> >>> based on some idle time limit?
>> >>>
>> >>> On Thu, Nov 15, 2012 at 8:14 PM, Mohit Anchlia <
>> mohitanchlia@gmail.com>
>> >>> wrote:
>> >>>>
>> >>>> Thanks Mike it makes sense. Anyway I can help?
>> >>>>
>> >>>>
>> >>>> On Thu, Nov 15, 2012 at 11:54 AM, Mike Percy <mp...@apache.org>
>> wrote:
>> >>>>>
>> >>>>> Hi Mohit, this is a complicated issue. I've filed
>> >>>>> https://issues.apache.org/jira/browse/FLUME-1714 to track it.
>> >>>>>
>> >>>>> In short, it would require a non-trivial amount of work to implement
>> >>>>> this, and it would need to be done carefully. I agree that it would
>> be
>> >>>>> better if Flume handled this case more gracefully than it does
>> today. Today,
>> >>>>> Flume assumes that you have some job that would go and clean up the
>> .tmp
>> >>>>> files as needed, and that you understand that they could be
>> partially
>> >>>>> written if a crash occurred.
>> >>>>>
>> >>>>> Regards,
>> >>>>> Mike
>> >>>>>
>> >>>>>
>> >>>>> On Sun, Nov 11, 2012 at 8:32 AM, Mohit Anchlia <
>> mohitanchlia@gmail.com>
>> >>>>> wrote:
>> >>>>>>
>> >>>>>> What we are seeing is that if flume gets killed either because of
>> >>>>>> server failure or other reasons, it keeps around the .tmp file.
>> Sometimes
>> >>>>>> for whatever reasons .tmp file is not readable. Is there a way to
>> rollover
>> >>>>>> .tmp file more gracefully?
>> >>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>>
>> >>
>> >
>>
>>
>>
>> --
>> Apache MRUnit - Unit testing MapReduce -
>> http://incubator.apache.org/mrunit/
>>
>
>

Re: .tmp in hdfs sink

Posted by Mike Percy <mp...@apache.org>.

Mohit,
No problem, but Juhani did all the work. :)

The behavior is that you can configure an HDFS sink to close a file if it
hasn't gotten any writes in some time. After it's been idle for 5 minutes
or something, it gets closed. If you get a "late" event that goes to the
same path after the file is closed, it will just create a new file in the
same path as usual.

Regards,
Mike

On Tue, Nov 20, 2012 at 12:56 PM, Brock Noland <br...@cloudera.com> wrote:

> We are currently voting on a 1.3.0 RC on the dev@ list:
>
> http://s.apache.org/OQ0W
>
> You don't have to be a committer to vote! :)
>
> Brock
>
> On Tue, Nov 20, 2012 at 2:53 PM, Mohit Anchlia <mo...@gmail.com>
> wrote:
> > Thanks a lot!! Now with this what should be the expected behaviour? After
> > file is closed a new file is created for writes that come after closing
> the
> > file?
> >
> > Thanks again for committing this change. Do you know when 1.3.0 is out?
> I am
> > currently using the snapshot version of 1.3.0
> >
> > On Tue, Nov 20, 2012 at 11:16 AM, Mike Percy <mp...@apache.org> wrote:
> >>
> >> Mohit,
> >> FLUME-1660 is now committed and it will be in 1.3.0. In the case where
> you
> >> are using 1.2.0, I suggest running with hdfs.rollInterval set so the
> files
> >> will roll normally.
> >>
> >> Regards,
> >> Mike
> >>
> >>
> >> On Thu, Nov 15, 2012 at 11:23 PM, Juhani Connolly
> >> <ju...@cyberagent.co.jp> wrote:
> >>>
> >>> I am actually working on a patch for exactly this, refer to FLUME-1660
> >>>
> >>> The patch is on review board right now, I fixed a corner case issue
> that
> >>> came up with unit testing, but the implementation is not really to my
> >>> satisfaction. If you are interested please have a look and add your
> opinion.
> >>>
> >>> https://issues.apache.org/jira/browse/FLUME-1660
> >>> https://reviews.apache.org/r/7659/
> >>>
> >>>
> >>> On 11/16/2012 01:16 PM, Mohit Anchlia wrote:
> >>>
> >>> Another question I had was about rollover. What's the best way to
> >>> rollover files in reasonable timeframe? For instance our path is
> YY/MM/DD/HH
> >>> so every hour there is new file and the -1 hr is just sitting with
> .tmp and
> >>> it takes sometimes even hour before .tmp is closed and renamed to
> .snappy.
> >>> In this situation is there a way to tell flume to rollover files sooner
> >>> based on some idle time limit?
> >>>
> >>> On Thu, Nov 15, 2012 at 8:14 PM, Mohit Anchlia <mohitanchlia@gmail.com
> >
> >>> wrote:
> >>>>
> >>>> Thanks Mike it makes sense. Anyway I can help?
> >>>>
> >>>>
> >>>> On Thu, Nov 15, 2012 at 11:54 AM, Mike Percy <mp...@apache.org>
> wrote:
> >>>>>
> >>>>> Hi Mohit, this is a complicated issue. I've filed
> >>>>> https://issues.apache.org/jira/browse/FLUME-1714 to track it.
> >>>>>
> >>>>> In short, it would require a non-trivial amount of work to implement
> >>>>> this, and it would need to be done carefully. I agree that it would
> be
> >>>>> better if Flume handled this case more gracefully than it does
> today. Today,
> >>>>> Flume assumes that you have some job that would go and clean up the
> .tmp
> >>>>> files as needed, and that you understand that they could be partially
> >>>>> written if a crash occurred.
> >>>>>
> >>>>> Regards,
> >>>>> Mike
> >>>>>
> >>>>>
> >>>>> On Sun, Nov 11, 2012 at 8:32 AM, Mohit Anchlia <
> mohitanchlia@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> What we are seeing is that if flume gets killed either because of
> >>>>>> server failure or other reasons, it keeps around the .tmp file.
> Sometimes
> >>>>>> for whatever reasons .tmp file is not readable. Is there a way to
> rollover
> >>>>>> .tmp file more gracefully?
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >
>
>
>
> --
> Apache MRUnit - Unit testing MapReduce -
> http://incubator.apache.org/mrunit/
>

Re: .tmp in hdfs sink

Posted by Brock Noland <br...@cloudera.com>.

We are currently voting on a 1.3.0 RC on the dev@ list:

http://s.apache.org/OQ0W

You don't have to be a committer to vote! :)

Brock

On Tue, Nov 20, 2012 at 2:53 PM, Mohit Anchlia <mo...@gmail.com> wrote:
> Thanks a lot!! Now with this what should be the expected behaviour? After
> file is closed a new file is created for writes that come after closing the
> file?
>
> Thanks again for committing this change. Do you know when 1.3.0 is out? I am
> currently using the snapshot version of 1.3.0
>
> On Tue, Nov 20, 2012 at 11:16 AM, Mike Percy <mp...@apache.org> wrote:
>>
>> Mohit,
>> FLUME-1660 is now committed and it will be in 1.3.0. In the case where you
>> are using 1.2.0, I suggest running with hdfs.rollInterval set so the files
>> will roll normally.
>>
>> Regards,
>> Mike
>>
>>
>> On Thu, Nov 15, 2012 at 11:23 PM, Juhani Connolly
>> <ju...@cyberagent.co.jp> wrote:
>>>
>>> I am actually working on a patch for exactly this, refer to FLUME-1660
>>>
>>> The patch is on review board right now, I fixed a corner case issue that
>>> came up with unit testing, but the implementation is not really to my
>>> satisfaction. If you are interested please have a look and add your opinion.
>>>
>>> https://issues.apache.org/jira/browse/FLUME-1660
>>> https://reviews.apache.org/r/7659/
>>>
>>>
>>> On 11/16/2012 01:16 PM, Mohit Anchlia wrote:
>>>
>>> Another question I had was about rollover. What's the best way to
>>> rollover files in reasonable timeframe? For instance our path is YY/MM/DD/HH
>>> so every hour there is new file and the -1 hr is just sitting with .tmp and
>>> it takes sometimes even hour before .tmp is closed and renamed to .snappy.
>>> In this situation is there a way to tell flume to rollover files sooner
>>> based on some idle time limit?
>>>
>>> On Thu, Nov 15, 2012 at 8:14 PM, Mohit Anchlia <mo...@gmail.com>
>>> wrote:
>>>>
>>>> Thanks Mike it makes sense. Anyway I can help?
>>>>
>>>>
>>>> On Thu, Nov 15, 2012 at 11:54 AM, Mike Percy <mp...@apache.org> wrote:
>>>>>
>>>>> Hi Mohit, this is a complicated issue. I've filed
>>>>> https://issues.apache.org/jira/browse/FLUME-1714 to track it.
>>>>>
>>>>> In short, it would require a non-trivial amount of work to implement
>>>>> this, and it would need to be done carefully. I agree that it would be
>>>>> better if Flume handled this case more gracefully than it does today. Today,
>>>>> Flume assumes that you have some job that would go and clean up the .tmp
>>>>> files as needed, and that you understand that they could be partially
>>>>> written if a crash occurred.
>>>>>
>>>>> Regards,
>>>>> Mike
>>>>>
>>>>>
>>>>> On Sun, Nov 11, 2012 at 8:32 AM, Mohit Anchlia <mo...@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> What we are seeing is that if flume gets killed either because of
>>>>>> server failure or other reasons, it keeps around the .tmp file. Sometimes
>>>>>> for whatever reasons .tmp file is not readable. Is there a way to rollover
>>>>>> .tmp file more gracefully?
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>



-- 
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/

Re: .tmp in hdfs sink

Posted by Mohit Anchlia <mo...@gmail.com>.

Thanks a lot!! Now with this what should be the expected behaviour? After
file is closed a new file is created for writes that come after closing the
file?

Thanks again for committing this change. Do you know when 1.3.0 is out? I
am currently using the snapshot version of 1.3.0

On Tue, Nov 20, 2012 at 11:16 AM, Mike Percy <mp...@apache.org> wrote:

> Mohit,
> FLUME-1660 is now committed and it will be in 1.3.0. In the case where you
> are using 1.2.0, I suggest running with hdfs.rollInterval set so the files
> will roll normally.
>
> Regards,
> Mike
>
>
> On Thu, Nov 15, 2012 at 11:23 PM, Juhani Connolly <
> juhani_connolly@cyberagent.co.jp> wrote:
>
>>  I am actually working on a patch for exactly this, refer to FLUME-1660
>>
>> The patch is on review board right now, I fixed a corner case issue that
>> came up with unit testing, but the implementation is not really to my
>> satisfaction. If you are interested please have a look and add your opinion.
>>
>> https://issues.apache.org/jira/browse/FLUME-1660
>> https://reviews.apache.org/r/7659/
>>
>>
>> On 11/16/2012 01:16 PM, Mohit Anchlia wrote:
>>
>> Another question I had was about rollover. What's the best way to
>> rollover files in reasonable timeframe? For instance our path is
>> YY/MM/DD/HH so every hour there is new file and the -1 hr is just sitting
>> with .tmp and it takes sometimes even hour before .tmp is closed and
>> renamed to .snappy. In this situation is there a way to tell flume to
>> rollover files sooner based on some idle time limit?
>>
>> On Thu, Nov 15, 2012 at 8:14 PM, Mohit Anchlia <mo...@gmail.com>wrote:
>>
>>> Thanks Mike it makes sense. Anyway I can help?
>>>
>>>
>>> On Thu, Nov 15, 2012 at 11:54 AM, Mike Percy <mp...@apache.org> wrote:
>>>
>>>> Hi Mohit, this is a complicated issue. I've filed
>>>> https://issues.apache.org/jira/browse/FLUME-1714 to track it.
>>>>
>>>> In short, it would require a non-trivial amount of work to implement
>>>> this, and it would need to be done carefully. I agree that it would be
>>>> better if Flume handled this case more gracefully than it does today.
>>>> Today, Flume assumes that you have some job that would go and clean up the
>>>> .tmp files as needed, and that you understand that they could be partially
>>>> written if a crash occurred.
>>>>
>>>> Regards,
>>>> Mike
>>>>
>>>>
>>>> On Sun, Nov 11, 2012 at 8:32 AM, Mohit Anchlia <mo...@gmail.com>wrote:
>>>>
>>>>> What we are seeing is that if flume gets killed either because of
>>>>> server failure or other reasons, it keeps around the .tmp file. Sometimes
>>>>> for whatever reasons .tmp file is not readable. Is there a way to rollover
>>>>> .tmp file more gracefully?
>>>>>
>>>>
>>>>
>>>
>>
>>
>

Re: .tmp in hdfs sink

Posted by Mike Percy <mp...@apache.org>.

Mohit,
FLUME-1660 is now committed and it will be in 1.3.0. In the case where you
are using 1.2.0, I suggest running with hdfs.rollInterval set so the files
will roll normally.

Regards,
Mike

On Thu, Nov 15, 2012 at 11:23 PM, Juhani Connolly <
juhani_connolly@cyberagent.co.jp> wrote:

>  I am actually working on a patch for exactly this, refer to FLUME-1660
>
> The patch is on review board right now, I fixed a corner case issue that
> came up with unit testing, but the implementation is not really to my
> satisfaction. If you are interested please have a look and add your opinion.
>
> https://issues.apache.org/jira/browse/FLUME-1660
> https://reviews.apache.org/r/7659/
>
>
> On 11/16/2012 01:16 PM, Mohit Anchlia wrote:
>
> Another question I had was about rollover. What's the best way to rollover
> files in reasonable timeframe? For instance our path is YY/MM/DD/HH so
> every hour there is new file and the -1 hr is just sitting with .tmp and it
> takes sometimes even hour before .tmp is closed and renamed to .snappy. In
> this situation is there a way to tell flume to rollover files sooner based
> on some idle time limit?
>
> On Thu, Nov 15, 2012 at 8:14 PM, Mohit Anchlia <mo...@gmail.com>wrote:
>
>> Thanks Mike it makes sense. Anyway I can help?
>>
>>
>> On Thu, Nov 15, 2012 at 11:54 AM, Mike Percy <mp...@apache.org> wrote:
>>
>>> Hi Mohit, this is a complicated issue. I've filed
>>> https://issues.apache.org/jira/browse/FLUME-1714 to track it.
>>>
>>>  In short, it would require a non-trivial amount of work to implement
>>> this, and it would need to be done carefully. I agree that it would be
>>> better if Flume handled this case more gracefully than it does today.
>>> Today, Flume assumes that you have some job that would go and clean up the
>>> .tmp files as needed, and that you understand that they could be partially
>>> written if a crash occurred.
>>>
>>>  Regards,
>>> Mike
>>>
>>>
>>> On Sun, Nov 11, 2012 at 8:32 AM, Mohit Anchlia <mo...@gmail.com>wrote:
>>>
>>>> What we are seeing is that if flume gets killed either because of
>>>> server failure or other reasons, it keeps around the .tmp file. Sometimes
>>>> for whatever reasons .tmp file is not readable. Is there a way to rollover
>>>> .tmp file more gracefully?
>>>>
>>>
>>>
>>
>
>

Re: .tmp in hdfs sink

Posted by Juhani Connolly <ju...@cyberagent.co.jp>.

I am actually working on a patch for exactly this, refer to FLUME-1660

The patch is on review board right now, I fixed a corner case issue that 
came up with unit testing, but the implementation is not really to my 
satisfaction. If you are interested please have a look and add your opinion.

https://issues.apache.org/jira/browse/FLUME-1660
https://reviews.apache.org/r/7659/

On 11/16/2012 01:16 PM, Mohit Anchlia wrote:
> Another question I had was about rollover. What's the best way to 
> rollover files in reasonable timeframe? For instance our path is 
> YY/MM/DD/HH so every hour there is new file and the -1 hr is just 
> sitting with .tmp and it takes sometimes even hour before .tmp is 
> closed and renamed to .snappy. In this situation is there a way to 
> tell flume to rollover files sooner based on some idle time limit?
>
> On Thu, Nov 15, 2012 at 8:14 PM, Mohit Anchlia <mohitanchlia@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Thanks Mike it makes sense. Anyway I can help?
>
>
>     On Thu, Nov 15, 2012 at 11:54 AM, Mike Percy <mpercy@apache.org
>     <ma...@apache.org>> wrote:
>
>         Hi Mohit, this is a complicated issue. I've filed
>         https://issues.apache.org/jira/browse/FLUME-1714 to track it.
>
>         In short, it would require a non-trivial amount of work to
>         implement this, and it would need to be done carefully. I
>         agree that it would be better if Flume handled this case more
>         gracefully than it does today. Today, Flume assumes that you
>         have some job that would go and clean up the .tmp files as
>         needed, and that you understand that they could be partially
>         written if a crash occurred.
>
>         Regards,
>         Mike
>
>
>         On Sun, Nov 11, 2012 at 8:32 AM, Mohit Anchlia
>         <mohitanchlia@gmail.com <ma...@gmail.com>> wrote:
>
>             What we are seeing is that if flume gets killed either
>             because of server failure or other reasons, it keeps
>             around the .tmp file. Sometimes for whatever reasons .tmp
>             file is not readable. Is there a way to rollover .tmp file
>             more gracefully?
>
>
>
>

Re: .tmp in hdfs sink

Posted by Mohit Anchlia <mo...@gmail.com>.

Another question I had was about rollover. What's the best way to rollover
files in reasonable timeframe? For instance our path is YY/MM/DD/HH so
every hour there is new file and the -1 hr is just sitting with .tmp and it
takes sometimes even hour before .tmp is closed and renamed to .snappy. In
this situation is there a way to tell flume to rollover files sooner based
on some idle time limit?

On Thu, Nov 15, 2012 at 8:14 PM, Mohit Anchlia <mo...@gmail.com>wrote:

> Thanks Mike it makes sense. Anyway I can help?
>
>
> On Thu, Nov 15, 2012 at 11:54 AM, Mike Percy <mp...@apache.org> wrote:
>
>> Hi Mohit, this is a complicated issue. I've filed
>> https://issues.apache.org/jira/browse/FLUME-1714 to track it.
>>
>> In short, it would require a non-trivial amount of work to implement
>> this, and it would need to be done carefully. I agree that it would be
>> better if Flume handled this case more gracefully than it does today.
>> Today, Flume assumes that you have some job that would go and clean up the
>> .tmp files as needed, and that you understand that they could be partially
>> written if a crash occurred.
>>
>> Regards,
>> Mike
>>
>>
>> On Sun, Nov 11, 2012 at 8:32 AM, Mohit Anchlia <mo...@gmail.com>wrote:
>>
>>> What we are seeing is that if flume gets killed either because of server
>>> failure or other reasons, it keeps around the .tmp file. Sometimes for
>>> whatever reasons .tmp file is not readable. Is there a way to rollover .tmp
>>> file more gracefully?
>>>
>>
>>
>

Re: .tmp in hdfs sink

Posted by Mohit Anchlia <mo...@gmail.com>.

Thanks Mike it makes sense. Anyway I can help?

On Thu, Nov 15, 2012 at 11:54 AM, Mike Percy <mp...@apache.org> wrote:

> Hi Mohit, this is a complicated issue. I've filed
> https://issues.apache.org/jira/browse/FLUME-1714 to track it.
>
> In short, it would require a non-trivial amount of work to implement this,
> and it would need to be done carefully. I agree that it would be better if
> Flume handled this case more gracefully than it does today. Today, Flume
> assumes that you have some job that would go and clean up the .tmp files as
> needed, and that you understand that they could be partially written if a
> crash occurred.
>
> Regards,
> Mike
>
>
> On Sun, Nov 11, 2012 at 8:32 AM, Mohit Anchlia <mo...@gmail.com>wrote:
>
>> What we are seeing is that if flume gets killed either because of server
>> failure or other reasons, it keeps around the .tmp file. Sometimes for
>> whatever reasons .tmp file is not readable. Is there a way to rollover .tmp
>> file more gracefully?
>>
>
>

Re: .tmp in hdfs sink

Posted by Mike Percy <mp...@apache.org>.

Hi Mohit, this is a complicated issue. I've filed
https://issues.apache.org/jira/browse/FLUME-1714 to track it.

In short, it would require a non-trivial amount of work to implement this,
and it would need to be done carefully. I agree that it would be better if
Flume handled this case more gracefully than it does today. Today, Flume
assumes that you have some job that would go and clean up the .tmp files as
needed, and that you understand that they could be partially written if a
crash occurred.

Regards,
Mike

On Sun, Nov 11, 2012 at 8:32 AM, Mohit Anchlia <mo...@gmail.com>wrote:

> What we are seeing is that if flume gets killed either because of server
> failure or other reasons, it keeps around the .tmp file. Sometimes for
> whatever reasons .tmp file is not readable. Is there a way to rollover .tmp
> file more gracefully?
>