You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by "Bhaskar V. Karambelkar" <bh...@gmail.com> on 2013/01/17 21:07:46 UTC

hdfs.idleTimeout ,what's it used for ?

Say If I have

a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/

hdfs.rollInterval=60

Now, if there is a file
/flume/events/2013-01-17/flume_XXXXXXXXX.tmp
This file is not ready to be rolled over yet, i.e. 60 seconds are not
up and now it's past 12 midnight, i.e. new day
And events start to be written to
/flume/events/2013-01-18/flume_XXXXXXXX.tmp

will the file 2013-01-17 never be rolled over, unless I have something
like hdfs.idleTimeout=60  ?
If so how do flume sinks keep track of files they need to rollover
after idealTimeout ?

In short what's the exact use of idealTimeout parameter ?

Re: hdfs.idleTimeout ,what's it used for ?

Posted by Mohit Anchlia <mo...@gmail.com>.

I have been using it and it's great feature to have.

One question I have though is, what happens when flume dies unexpectedly,
does it leave .tmp files behind? How to clean those away and close it
gracefully?

On Thu, Jan 17, 2013 at 6:08 PM, Juhani Connolly <
juhani_connolly@cyberagent.co.jp> wrote:

> It's also useful if you want files to get promptly closed and renamed from
> the .tmp or whatever.
>
> We use it with something like 30seconds setting(we have a constant stream
> of data) and hourly bucketing.
>
> There is also the issue that files closed by rollInterval are never
> removed from the internal linkedList so it actually causes a small memory
> leak(which can get big in the long term if you have a lot of files and
> hourly renames). I believe this is what is causing the OOM Mohit is getting
> in FLUME-1850
>
> So I personally would recommend using it(with a setting that will close
> files before rollInterval does).
>
> On 01/18/2013 06:38 AM, Bhaskar V. Karambelkar wrote:
>
>> Ah I see. Again something useful to have in the flume user guide.
>>
>> On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson <cw...@gmail.com>
>> wrote:
>>
>>> the rollInterval will still cause the last 01-17 file to be closed
>>> eventually. The way the HDFS sink works with the different files is each
>>> unique path is specified by a different BucketWriter object. The sink can
>>> hold as many objects as specified by hdfs.maxOpenWorkers (default: 5000),
>>> and bucketwriters are only removed when you create the 5001th writer
>>> (5001th
>>> unique path). However, generally once a writer is closed it is never used
>>> again (all of your 1-17 writers will never be used again). To avoid
>>> keeping
>>> them in the sink's internal list of writers, the idleTimeout is a
>>> specified
>>> number of seconds in which no data is received by the BucketWriter. After
>>> this time, the writer will try to close itself and will then tell the
>>> sink
>>> to remove it, thus freeing up everything used by the bucketwriter.
>>>
>>> So the idleTimeout is just a setting to help limit memory usage by the
>>> hdfs
>>> sink. The ideal time for it is longer than the maximum time between
>>> events
>>> (capped at the rollInterval) - if you know you'll receive a constant
>>> stream
>>> of events you might just set it to a minute or something. Or if you are
>>> fine
>>> with having multiple files open per hour, you can set it to a lower
>>> number;
>>> maybe just over the average time between events. For me in just testing,
>>> I
>>> set it >= rollInterval for the cases when no events are received in a
>>> given
>>> hour (I'd rather keep the object alive for an extra hour than create
>>> files
>>> every 30 minutes or something).
>>>
>>> Hope that was helpful,
>>>
>>> - Connor
>>>
>>>
>>> On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V. Karambelkar
>>> <bh...@gmail.com> wrote:
>>>
>>>> Say If I have
>>>>
>>>> a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/
>>>>
>>>> hdfs.rollInterval=60
>>>>
>>>> Now, if there is a file
>>>> /flume/events/2013-01-17/**flume_XXXXXXXXX.tmp
>>>> This file is not ready to be rolled over yet, i.e. 60 seconds are not
>>>> up and now it's past 12 midnight, i.e. new day
>>>> And events start to be written to
>>>> /flume/events/2013-01-18/**flume_XXXXXXXX.tmp
>>>>
>>>> will the file 2013-01-17 never be rolled over, unless I have something
>>>> like hdfs.idleTimeout=60  ?
>>>> If so how do flume sinks keep track of files they need to rollover
>>>> after idealTimeout ?
>>>>
>>>> In short what's the exact use of idealTimeout parameter ?
>>>>
>>>
>>>
>

Re: hdfs.idleTimeout ,what's it used for ?

Posted by Connor Woodson <cw...@gmail.com>.

Ya, I read this first; I find the implementation of the idleTimeout
slightly odd that it doesn't persist through the file closing.


On Thu, Jan 17, 2013 at 6:39 PM, Juhani Connolly <
juhani_connolly@cyberagent.co.jp> wrote:

>  I lined up why it was happening in FLUME-1850
>
> He has hourly rolls, a 4000 interval and a 900 idle.
>
> After an hour 400 remains on the interval. So the interval gets triggered
> first, which triggers close, which cancels all timers including the
> idleTimeout. Thus the entry in sfWriters remains. His memory dump confirms
> this(he has a huge sfWriters map in memory after 30 days). I also confirmed
> this behaviour of rollInterval when developing the idleTimeout feature.
>
> You're right  about the limit on the size of sfWriters. With a limit of
> 5000, even if the closed ones stay in the list, they shouldn't be that big
> since buffers should be cleaned up.
>
> idleTimeout will indeed result in more files if you don't have a steady
> stream of files. It is most useful with a steady stream of data and time
> bucketed data. In such situations, I might even recommend not using
> rollInterval at all and having a short idleTimeout(or if you're not in a
> rush to get your file closed, give it a comfortably long timeout)
>
>
> On 01/18/2013 11:19 AM, Connor Woodson wrote:
>
>  Whether idleTimeout is lower or higher than rollInterval is a
> preference; set it before, and assume you get one message right on the turn
> of the hour, then you will have some part of that hour without any bucket
> writers; but if you get another message at the end of the hour, you will
> end up with two files instead of one. Set it idleTimeout to be longer and
> you will get just one file, but also (at worst case) you will have twice as
> many bucketwriters open; so it all depends on how many files you want/how
> much memory you have to spare.
>
>  - Connor
>
>  An aside:
> bucketwriters, after being closed by rollInterval, aren't really a memory
> leak; they just are very rarely useful to keep around (your path could rely
> on hostname, and you could use a rollinterval, and then those bucketwriters
> will still remain useful). And they will get removed eventually; by default
> after you've created your 5001st bucketwriter, the first (or whichever was
> used longest ago) will be removed.
>
>  And I don't think that's the cause behind 1850 as he did have an
> idleTimeout set at 15 minutes.
>
>
> On Thu, Jan 17, 2013 at 6:08 PM, Juhani Connolly <
> juhani_connolly@cyberagent.co.jp> wrote:
>
>> It's also useful if you want files to get promptly closed and renamed
>> from the .tmp or whatever.
>>
>> We use it with something like 30seconds setting(we have a constant stream
>> of data) and hourly bucketing.
>>
>> There is also the issue that files closed by rollInterval are never
>> removed from the internal linkedList so it actually causes a small memory
>> leak(which can get big in the long term if you have a lot of files and
>> hourly renames). I believe this is what is causing the OOM Mohit is getting
>> in FLUME-1850
>>
>> So I personally would recommend using it(with a setting that will close
>> files before rollInterval does).
>>
>>
>> On 01/18/2013 06:38 AM, Bhaskar V. Karambelkar wrote:
>>
>>> Ah I see. Again something useful to have in the flume user guide.
>>>
>>> On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson <cw...@gmail.com>
>>> wrote:
>>>
>>>> the rollInterval will still cause the last 01-17 file to be closed
>>>> eventually. The way the HDFS sink works with the different files is each
>>>> unique path is specified by a different BucketWriter object. The sink
>>>> can
>>>> hold as many objects as specified by hdfs.maxOpenWorkers (default:
>>>> 5000),
>>>> and bucketwriters are only removed when you create the 5001th writer
>>>> (5001th
>>>> unique path). However, generally once a writer is closed it is never
>>>> used
>>>> again (all of your 1-17 writers will never be used again). To avoid
>>>> keeping
>>>> them in the sink's internal list of writers, the idleTimeout is a
>>>> specified
>>>> number of seconds in which no data is received by the BucketWriter.
>>>> After
>>>> this time, the writer will try to close itself and will then tell the
>>>> sink
>>>> to remove it, thus freeing up everything used by the bucketwriter.
>>>>
>>>> So the idleTimeout is just a setting to help limit memory usage by the
>>>> hdfs
>>>> sink. The ideal time for it is longer than the maximum time between
>>>> events
>>>> (capped at the rollInterval) - if you know you'll receive a constant
>>>> stream
>>>> of events you might just set it to a minute or something. Or if you are
>>>> fine
>>>> with having multiple files open per hour, you can set it to a lower
>>>> number;
>>>> maybe just over the average time between events. For me in just
>>>> testing, I
>>>> set it >= rollInterval for the cases when no events are received in a
>>>> given
>>>> hour (I'd rather keep the object alive for an extra hour than create
>>>> files
>>>> every 30 minutes or something).
>>>>
>>>> Hope that was helpful,
>>>>
>>>> - Connor
>>>>
>>>>
>>>> On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V. Karambelkar
>>>> <bh...@gmail.com> wrote:
>>>>
>>>>> Say If I have
>>>>>
>>>>> a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/
>>>>>
>>>>> hdfs.rollInterval=60
>>>>>
>>>>> Now, if there is a file
>>>>> /flume/events/2013-01-17/flume_XXXXXXXXX.tmp
>>>>> This file is not ready to be rolled over yet, i.e. 60 seconds are not
>>>>> up and now it's past 12 midnight, i.e. new day
>>>>> And events start to be written to
>>>>> /flume/events/2013-01-18/flume_XXXXXXXX.tmp
>>>>>
>>>>> will the file 2013-01-17 never be rolled over, unless I have something
>>>>> like hdfs.idleTimeout=60  ?
>>>>> If so how do flume sinks keep track of files they need to rollover
>>>>> after idealTimeout ?
>>>>>
>>>>> In short what's the exact use of idealTimeout parameter ?
>>>>>
>>>>
>>>>
>>
>
>

Re: hdfs.idleTimeout ,what's it used for ?

Posted by Juhani Connolly <ju...@cyberagent.co.jp>.

I lined up why it was happening in FLUME-1850

He has hourly rolls, a 4000 interval and a 900 idle.

After an hour 400 remains on the interval. So the interval gets 
triggered first, which triggers close, which cancels all timers 
including the idleTimeout. Thus the entry in sfWriters remains. His 
memory dump confirms this(he has a huge sfWriters map in memory after 30 
days). I also confirmed this behaviour of rollInterval when developing 
the idleTimeout feature.

You're right  about the limit on the size of sfWriters. With a limit of 
5000, even if the closed ones stay in the list, they shouldn't be that 
big  since buffers should be cleaned up.

idleTimeout will indeed result in more files if you don't have a steady 
stream of files. It is most useful with a steady stream of data and time 
bucketed data. In such situations, I might even recommend not using 
rollInterval at all and having a short idleTimeout(or if you're not in a 
rush to get your file closed, give it a comfortably long timeout)

On 01/18/2013 11:19 AM, Connor Woodson wrote:
> Whether idleTimeout is lower or higher than rollInterval is a 
> preference; set it before, and assume you get one message right on the 
> turn of the hour, then you will have some part of that hour without 
> any bucket writers; but if you get another message at the end of the 
> hour, you will end up with two files instead of one. Set it 
> idleTimeout to be longer and you will get just one file, but also (at 
> worst case) you will have twice as many bucketwriters open; so it all 
> depends on how many files you want/how much memory you have to spare.
>
> - Connor
>
> An aside:
> bucketwriters, after being closed by rollInterval, aren't really a 
> memory leak; they just are very rarely useful to keep around (your 
> path could rely on hostname, and you could use a rollinterval, and 
> then those bucketwriters will still remain useful). And they will get 
> removed eventually; by default after you've created your 5001st 
> bucketwriter, the first (or whichever was used longest ago) will be 
> removed.
>
> And I don't think that's the cause behind 1850 as he did have an 
> idleTimeout set at 15 minutes.
>
>
> On Thu, Jan 17, 2013 at 6:08 PM, Juhani Connolly 
> <juhani_connolly@cyberagent.co.jp 
> <ma...@cyberagent.co.jp>> wrote:
>
>     It's also useful if you want files to get promptly closed and
>     renamed from the .tmp or whatever.
>
>     We use it with something like 30seconds setting(we have a constant
>     stream of data) and hourly bucketing.
>
>     There is also the issue that files closed by rollInterval are
>     never removed from the internal linkedList so it actually causes a
>     small memory leak(which can get big in the long term if you have a
>     lot of files and hourly renames). I believe this is what is
>     causing the OOM Mohit is getting in FLUME-1850
>
>     So I personally would recommend using it(with a setting that will
>     close files before rollInterval does).
>
>
>     On 01/18/2013 06:38 AM, Bhaskar V. Karambelkar wrote:
>
>         Ah I see. Again something useful to have in the flume user guide.
>
>         On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson
>         <cwoodson.dev@gmail.com <ma...@gmail.com>> wrote:
>
>             the rollInterval will still cause the last 01-17 file to
>             be closed
>             eventually. The way the HDFS sink works with the different
>             files is each
>             unique path is specified by a different BucketWriter
>             object. The sink can
>             hold as many objects as specified by hdfs.maxOpenWorkers
>             (default: 5000),
>             and bucketwriters are only removed when you create the
>             5001th writer (5001th
>             unique path). However, generally once a writer is closed
>             it is never used
>             again (all of your 1-17 writers will never be used again).
>             To avoid keeping
>             them in the sink's internal list of writers, the
>             idleTimeout is a specified
>             number of seconds in which no data is received by the
>             BucketWriter. After
>             this time, the writer will try to close itself and will
>             then tell the sink
>             to remove it, thus freeing up everything used by the
>             bucketwriter.
>
>             So the idleTimeout is just a setting to help limit memory
>             usage by the hdfs
>             sink. The ideal time for it is longer than the maximum
>             time between events
>             (capped at the rollInterval) - if you know you'll receive
>             a constant stream
>             of events you might just set it to a minute or something.
>             Or if you are fine
>             with having multiple files open per hour, you can set it
>             to a lower number;
>             maybe just over the average time between events. For me in
>             just testing, I
>             set it >= rollInterval for the cases when no events are
>             received in a given
>             hour (I'd rather keep the object alive for an extra hour
>             than create files
>             every 30 minutes or something).
>
>             Hope that was helpful,
>
>             - Connor
>
>
>             On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V. Karambelkar
>             <bhaskarvk@gmail.com <ma...@gmail.com>> wrote:
>
>                 Say If I have
>
>                 a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/
>
>                 hdfs.rollInterval=60
>
>                 Now, if there is a file
>                 /flume/events/2013-01-17/flume_XXXXXXXXX.tmp
>                 This file is not ready to be rolled over yet, i.e. 60
>                 seconds are not
>                 up and now it's past 12 midnight, i.e. new day
>                 And events start to be written to
>                 /flume/events/2013-01-18/flume_XXXXXXXX.tmp
>
>                 will the file 2013-01-17 never be rolled over, unless
>                 I have something
>                 like hdfs.idleTimeout=60  ?
>                 If so how do flume sinks keep track of files they need
>                 to rollover
>                 after idealTimeout ?
>
>                 In short what's the exact use of idealTimeout parameter ?
>
>
>
>

Re: hdfs.idleTimeout ,what's it used for ?

Posted by Juhani Connolly <ju...@cyberagent.co.jp>.

>>
>> @Mohit:
>>
>> When flume dies unexpectedly the .tmp file remains. When it restarts 
>> there is some logic in HDFS sink to recover it(and continue writing 
>> from there). I'm not actually sure of the specifics. You may want to 
>> try and just kill -9 a running flume process on a test machine and 
>> then start it up, look at the logs and see what happens with the output.
>
> Does it also work when there is a long delay before flume gets 
> started? We are bucketing by the hr so if start occurs in the next 
> hour but flume actually died in previous hr and had  .tmp then does it 
> still cleanup on restart

I'm not sure. I think your best bet here is to simulate this on a test 
server. Start flume, after a bit kill 9 the process, wait until the 
bucket becomes invalid, and restart.

My gut feeling is that it will recover if you have events with the 
timestamp belonging to that bucket still incoming (in your persistent 
channelor read in after recovery). If that path doesn't get touched 
again though, it will probably remain as a .tmp file? *This could be 
blatantly wrong, so I suggest you test it*

Re: hdfs.idleTimeout ,what's it used for ?

Posted by Mohit Anchlia <mo...@gmail.com>.


Sent from my iPhone

On Jan 17, 2013, at 6:46 PM, Juhani Connolly <ju...@cyberagent.co.jp> wrote:

> It seemed neater at the time. It's only an issue because rollInterval doesn't remove the entry in sfWriters. We could change it so that close doesn't cancel it, and have it check whether or not the writer is already closed, but that'd be kind of ugly.
> 
> @Mohit:
> 
> When flume dies unexpectedly the .tmp file remains. When it restarts there is some logic in HDFS sink to recover it(and continue writing from there). I'm not actually sure of the specifics. You may want to try and just kill -9 a running flume process on a test machine and then start it up, look at the logs and see what happens with the output.

Does it also work when there is a long delay before flume gets started? We are bucketing by the hr so if start occurs in the next hour but flume actually died in previous hr and had  .tmp then does it still cleanup on restart
> 
> If flume dies cleanly the file is properly closed.
> 
> On 01/18/2013 11:23 AM, Connor Woodson wrote:
>> And @ my aside: I hadn't realized that the idleTimeout is canceled by the rollInterval occurring. That's annoying. So setting a lower idleTimeout, and drastically decreasing maxOpenFiles to at most 2 * possible open files, is probably necessary.
>> 
>> 
>> On Thu, Jan 17, 2013 at 6:20 PM, Connor Woodson <cw...@gmail.com> wrote:
>>> @Mohit:
>>> 
>>> For the HDFS Sink, the tmp files are placed based on the hadoop.tmp.dir property. The default location is /tmp/hadoop-${user.name} To change this you can add -Dhadoop.tmp.dir=<path>                 to your Flume command line call, or you can specify the property in the core-site.xml of wherever your                 HADOOP_HOME environment variable points to.
>>> 
>>> - Connor
>>> 
>>> 
>>> On Thu, Jan 17, 2013 at 6:19 PM, Connor Woodson <cw...@gmail.com> wrote:
>>>> Whether idleTimeout is lower or higher than rollInterval is a preference; set it before, and assume you get one message right on the turn of the hour, then you will have some part of that hour without any bucket writers; but if you get another message at the end of the hour, you will end up with two files instead of one. Set it idleTimeout to be longer and you will get just one file, but also (at worst case) you will have twice as many bucketwriters open; so it all depends on how many files you want/how much memory you have to spare.
>>>> 
>>>> - Connor
>>>> 
>>>> An aside:
>>>> bucketwriters, after being closed by rollInterval, aren't really a memory leak; they just are very rarely useful to keep around (your path could rely on hostname, and you could use a rollinterval, and then those bucketwriters will still remain useful). And they will get removed eventually; by default after you've created your 5001st bucketwriter, the first (or whichever was used longest ago) will be removed.
>>>> 
>>>> And I don't think that's the cause behind 1850 as he did have an idleTimeout set at 15 minutes.
>>>> 
>>>> 
>>>> On Thu, Jan 17, 2013 at 6:08 PM, Juhani Connolly <ju...@cyberagent.co.jp> wrote:
>>>>> It's also useful if you want files to get promptly closed and renamed from the .tmp or whatever.
>>>>> 
>>>>> We use it with something like 30seconds setting(we have a constant stream of data) and hourly bucketing.
>>>>> 
>>>>> There is also the issue that files closed by rollInterval are never removed from the internal linkedList so it actually causes a small memory leak(which can get big in the long term if you have a lot of files and hourly renames). I believe this is what is causing the OOM Mohit is getting in FLUME-1850
>>>>> 
>>>>> So I personally would recommend using it(with a setting that will close files before rollInterval does).
>>>>> 
>>>>> 
>>>>> On 01/18/2013 06:38 AM, Bhaskar V. Karambelkar wrote:
>>>>>> Ah I see. Again something useful to have in the flume user guide.
>>>>>> 
>>>>>> On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson <cw...@gmail.com> wrote:
>>>>>>> the rollInterval will still cause the last 01-17 file to be closed
>>>>>>> eventually. The way the HDFS sink works with the different files is each
>>>>>>> unique path is specified by a different BucketWriter object. The sink can
>>>>>>> hold as many objects as specified by hdfs.maxOpenWorkers (default: 5000),
>>>>>>> and bucketwriters are only                                         removed when you create the 5001th writer (5001th
>>>>>>> unique path). However, generally once a writer is closed it is never used
>>>>>>> again (all of your 1-17 writers will never be used again). To avoid keeping
>>>>>>> them in the sink's internal list of writers, the idleTimeout is a specified
>>>>>>> number of seconds in which no data is received by the BucketWriter. After
>>>>>>> this time, the writer will try                                         to close itself and will then                                         tell the sink
>>>>>>> to remove it, thus freeing up everything used by the                                         bucketwriter.
>>>>>>> 
>>>>>>> So the idleTimeout is just a setting to help limit memory usage by the hdfs
>>>>>>> sink. The ideal time for it is longer than the maximum time between events
>>>>>>> (capped at the rollInterval) - if you know you'll receive a constant stream
>>>>>>> of events you might just set it to a minute or something. Or if you are fine
>>>>>>> with having multiple files open per hour, you can set it to a lower number;
>>>>>>> maybe just over the average time between events. For me in just testing, I
>>>>>>> set it >= rollInterval for the cases when no events are                                         received in a given
>>>>>>> hour (I'd rather keep the object alive for an extra hour than create files
>>>>>>> every 30 minutes or something).
>>>>>>> 
>>>>>>> Hope that was helpful,
>>>>>>> 
>>>>>>> - Connor
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V. Karambelkar
>>>>>>> <bh...@gmail.com>                                         wrote:
>>>>>>>> Say If I have
>>>>>>>> 
>>>>>>>> a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/
>>>>>>>> 
>>>>>>>> hdfs.rollInterval=60
>>>>>>>> 
>>>>>>>> Now, if there is a file
>>>>>>>> /flume/events/2013-01-17/flume_XXXXXXXXX.tmp
>>>>>>>> This file is not ready to be rolled over yet, i.e. 60 seconds are not
>>>>>>>> up and now it's past 12 midnight, i.e. new day
>>>>>>>> And events start to be written to
>>>>>>>> /flume/events/2013-01-18/flume_XXXXXXXX.tmp
>>>>>>>> 
>>>>>>>> will the file 2013-01-17 never be rolled over, unless I have something
>>>>>>>> like hdfs.idleTimeout=60  ?
>>>>>>>> If so how do flume sinks keep track of files they need to rollover
>>>>>>>> after idealTimeout ?
>>>>>>>> 
>>>>>>>> In short what's the exact use of idealTimeout parameter ?
>

Re: hdfs.idleTimeout ,what's it used for ?

Posted by Connor Woodson <cw...@gmail.com>.

Alright, that makes sense. The takeaway from this conversation for everyone
else:

If you use idleTimeout, be sure to set the rollInterval to 0. And if you
don't use idleTimeout, be sure to lower maxOpenFiles to a number relative
to your expected throughput. To use the least memory, you will want to use
idleTimeout; but the result will be that more files created in hdfs.

- Connor


On Thu, Jan 17, 2013 at 7:39 PM, Juhani Connolly <
juhani_connolly@cyberagent.co.jp> wrote:

>  That breaks the use case idleTimeout was originally made for: making
> sure the file is closed promptly after data stops arriving. We use this to
> make sure the files ready for our batches which run quite soon after. The
> time that rollInterval will trigger is unpredictable as it will reset every
> time any other type of roll is triggered(event count or size).
>
> By making rollInterval behave properly all of this is a non-issue. My
> recommendation to users woudl be not to use rollInterval if they're
> bucketing by time(it's redundant behavior).
>
> Documentation could definitely be improved. Once we sort out the approach
> we want to take I can write it up to make the difference and usage clearer.
>
>
> On 01/18/2013 12:24 PM, Connor Woodson wrote:
>
> The way idleTimeout works right now is that it's another rollInterval; it
> will work best when rollInterval is not set and so it seems that it's use
> is best for when you don't want to use a rollInterval and just want to have
> your bucketwriters close when no events are coming through (caused by path
> change or something else; and you can still roll reliably with either count
> or size)
>
>  As such, perhaps it is more clear if idleTimeout is renamed to idleRoll
> or such?
>
>  And then change idleTimeout to only count seconds since it was closed;
> if a bucketwriter is closed for long enough it will automatically remove
> itself. This type of idle will then work well with rollInterval, while the
> other one doesn't (idleRoll + rollInterval creates two time-based rollers.
> There are certainly times for that, but not all of the time).
>
>  - Connor
>
>
> On Thu, Jan 17, 2013 at 6:46 PM, Juhani Connolly <
> juhani_connolly@cyberagent.co.jp> wrote:
>
>>  It seemed neater at the time. It's only an issue because rollInterval
>> doesn't remove the entry in sfWriters. We could change it so that close
>> doesn't cancel it, and have it check whether or not the writer is already
>> closed, but that'd be kind of ugly.
>>
>> @Mohit:
>>
>> When flume dies unexpectedly the .tmp file remains. When it restarts
>> there is some logic in HDFS sink to recover it(and continue writing from
>> there). I'm not actually sure of the specifics. You may want to try and
>> just kill -9 a running flume process on a test machine and then start it
>> up, look at the logs and see what happens with the output.
>>
>> If flume dies cleanly the file is properly closed.
>>
>>
>> On 01/18/2013 11:23 AM, Connor Woodson wrote:
>>
>> And @ my aside: I hadn't realized that the idleTimeout is canceled by the
>> rollInterval occurring. That's annoying. So setting a lower idleTimeout,
>> and drastically decreasing maxOpenFiles to at most 2 * possible open files,
>> is probably necessary.
>>
>>
>> On Thu, Jan 17, 2013 at 6:20 PM, Connor Woodson <cw...@gmail.com>wrote:
>>
>>> @Mohit:
>>>
>>>  For the HDFS Sink, the tmp files are placed based on the
>>> hadoop.tmp.dir property. The default location is /tmp/hadoop-${user.name}
>>> To change this you can add -Dhadoop.tmp.dir=<path> to your Flume command
>>> line call, or you can specify the property in the core-site.xml of wherever
>>> your HADOOP_HOME environment variable points to.
>>>
>>>  - Connor
>>>
>>>
>>> On Thu, Jan 17, 2013 at 6:19 PM, Connor Woodson <cw...@gmail.com>wrote:
>>>
>>>>  Whether idleTimeout is lower or higher than rollInterval is a
>>>> preference; set it before, and assume you get one message right on the turn
>>>> of the hour, then you will have some part of that hour without any bucket
>>>> writers; but if you get another message at the end of the hour, you will
>>>> end up with two files instead of one. Set it idleTimeout to be longer and
>>>> you will get just one file, but also (at worst case) you will have twice as
>>>> many bucketwriters open; so it all depends on how many files you want/how
>>>> much memory you have to spare.
>>>>
>>>>  - Connor
>>>>
>>>>  An aside:
>>>> bucketwriters, after being closed by rollInterval, aren't really a
>>>> memory leak; they just are very rarely useful to keep around (your path
>>>> could rely on hostname, and you could use a rollinterval, and then those
>>>> bucketwriters will still remain useful). And they will get removed
>>>> eventually; by default after you've created your 5001st bucketwriter, the
>>>> first (or whichever was used longest ago) will be removed.
>>>>
>>>>  And I don't think that's the cause behind 1850 as he did have an
>>>> idleTimeout set at 15 minutes.
>>>>
>>>>
>>>>  On Thu, Jan 17, 2013 at 6:08 PM, Juhani Connolly <
>>>> juhani_connolly@cyberagent.co.jp> wrote:
>>>>
>>>>> It's also useful if you want files to get promptly closed and renamed
>>>>> from the .tmp or whatever.
>>>>>
>>>>> We use it with something like 30seconds setting(we have a constant
>>>>> stream of data) and hourly bucketing.
>>>>>
>>>>> There is also the issue that files closed by rollInterval are never
>>>>> removed from the internal linkedList so it actually causes a small memory
>>>>> leak(which can get big in the long term if you have a lot of files and
>>>>> hourly renames). I believe this is what is causing the OOM Mohit is getting
>>>>> in FLUME-1850
>>>>>
>>>>> So I personally would recommend using it(with a setting that will
>>>>> close files before rollInterval does).
>>>>>
>>>>>
>>>>> On 01/18/2013 06:38 AM, Bhaskar V. Karambelkar wrote:
>>>>>
>>>>>> Ah I see. Again something useful to have in the flume user guide.
>>>>>>
>>>>>> On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson <
>>>>>> cwoodson.dev@gmail.com> wrote:
>>>>>>
>>>>>>> the rollInterval will still cause the last 01-17 file to be closed
>>>>>>> eventually. The way the HDFS sink works with the different files is
>>>>>>> each
>>>>>>> unique path is specified by a different BucketWriter object. The
>>>>>>> sink can
>>>>>>> hold as many objects as specified by hdfs.maxOpenWorkers (default:
>>>>>>> 5000),
>>>>>>> and bucketwriters are only removed when you create the 5001th writer
>>>>>>> (5001th
>>>>>>> unique path). However, generally once a writer is closed it is never
>>>>>>> used
>>>>>>> again (all of your 1-17 writers will never be used again). To avoid
>>>>>>> keeping
>>>>>>> them in the sink's internal list of writers, the idleTimeout is a
>>>>>>> specified
>>>>>>> number of seconds in which no data is received by the BucketWriter.
>>>>>>> After
>>>>>>> this time, the writer will try to close itself and will then tell
>>>>>>> the sink
>>>>>>> to remove it, thus freeing up everything used by the bucketwriter.
>>>>>>>
>>>>>>> So the idleTimeout is just a setting to help limit memory usage by
>>>>>>> the hdfs
>>>>>>> sink. The ideal time for it is longer than the maximum time between
>>>>>>> events
>>>>>>> (capped at the rollInterval) - if you know you'll receive a constant
>>>>>>> stream
>>>>>>> of events you might just set it to a minute or something. Or if you
>>>>>>> are fine
>>>>>>> with having multiple files open per hour, you can set it to a lower
>>>>>>> number;
>>>>>>> maybe just over the average time between events. For me in just
>>>>>>> testing, I
>>>>>>> set it >= rollInterval for the cases when no events are received in
>>>>>>> a given
>>>>>>> hour (I'd rather keep the object alive for an extra hour than create
>>>>>>> files
>>>>>>> every 30 minutes or something).
>>>>>>>
>>>>>>> Hope that was helpful,
>>>>>>>
>>>>>>> - Connor
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V. Karambelkar
>>>>>>> <bh...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Say If I have
>>>>>>>>
>>>>>>>> a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/
>>>>>>>>
>>>>>>>> hdfs.rollInterval=60
>>>>>>>>
>>>>>>>> Now, if there is a file
>>>>>>>> /flume/events/2013-01-17/flume_XXXXXXXXX.tmp
>>>>>>>> This file is not ready to be rolled over yet, i.e. 60 seconds are
>>>>>>>> not
>>>>>>>> up and now it's past 12 midnight, i.e. new day
>>>>>>>> And events start to be written to
>>>>>>>> /flume/events/2013-01-18/flume_XXXXXXXX.tmp
>>>>>>>>
>>>>>>>> will the file 2013-01-17 never be rolled over, unless I have
>>>>>>>> something
>>>>>>>> like hdfs.idleTimeout=60  ?
>>>>>>>> If so how do flume sinks keep track of files they need to rollover
>>>>>>>> after idealTimeout ?
>>>>>>>>
>>>>>>>> In short what's the exact use of idealTimeout parameter ?
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>>
>
>

Re: hdfs.idleTimeout ,what's it used for ?

Posted by Juhani Connolly <ju...@cyberagent.co.jp>.

That breaks the use case idleTimeout was originally made for: making 
sure the file is closed promptly after data stops arriving. We use this 
to make sure the files ready for our batches which run quite soon after. 
The time that rollInterval will trigger is unpredictable as it will 
reset every time any other type of roll is triggered(event count or size).

By making rollInterval behave properly all of this is a non-issue. My 
recommendation to users woudl be not to use rollInterval if they're 
bucketing by time(it's redundant behavior).

Documentation could definitely be improved. Once we sort out the 
approach we want to take I can write it up to make the difference and 
usage clearer.

On 01/18/2013 12:24 PM, Connor Woodson wrote:
> The way idleTimeout works right now is that it's another rollInterval; 
> it will work best when rollInterval is not set and so it seems that 
> it's use is best for when you don't want to use a rollInterval and 
> just want to have your bucketwriters close when no events are coming 
> through (caused by path change or something else; and you can still 
> roll reliably with either count or size)
>
> As such, perhaps it is more clear if idleTimeout is renamed to 
> idleRoll or such?
>
> And then change idleTimeout to only count seconds since it was closed; 
> if a bucketwriter is closed for long enough it will automatically 
> remove itself. This type of idle will then work well with 
> rollInterval, while the other one doesn't (idleRoll + rollInterval 
> creates two time-based rollers. There are certainly times for that, 
> but not all of the time).
>
> - Connor
>
>
> On Thu, Jan 17, 2013 at 6:46 PM, Juhani Connolly 
> <juhani_connolly@cyberagent.co.jp 
> <ma...@cyberagent.co.jp>> wrote:
>
>     It seemed neater at the time. It's only an issue because
>     rollInterval doesn't remove the entry in sfWriters. We could
>     change it so that close doesn't cancel it, and have it check
>     whether or not the writer is already closed, but that'd be kind of
>     ugly.
>
>     @Mohit:
>
>     When flume dies unexpectedly the .tmp file remains. When it
>     restarts there is some logic in HDFS sink to recover it(and
>     continue writing from there). I'm not actually sure of the
>     specifics. You may want to try and just kill -9 a running flume
>     process on a test machine and then start it up, look at the logs
>     and see what happens with the output.
>
>     If flume dies cleanly the file is properly closed.
>
>
>     On 01/18/2013 11:23 AM, Connor Woodson wrote:
>>     And @ my aside: I hadn't realized that the idleTimeout is
>>     canceled by the rollInterval occurring. That's annoying. So
>>     setting a lower idleTimeout, and drastically decreasing
>>     maxOpenFiles to at most 2 * possible open files, is probably
>>     necessary.
>>
>>
>>     On Thu, Jan 17, 2013 at 6:20 PM, Connor Woodson
>>     <cwoodson.dev@gmail.com <ma...@gmail.com>> wrote:
>>
>>         @Mohit:
>>
>>         For the HDFS Sink, the tmp files are placed based on the
>>         hadoop.tmp.dir property. The default location is
>>         /tmp/hadoop-${user.name <http://user.name>} To change this
>>         you can add -Dhadoop.tmp.dir=<path> to your Flume command
>>         line call, or you can specify the property in the
>>         core-site.xml of wherever your HADOOP_HOME environment
>>         variable points to.
>>
>>         - Connor
>>
>>
>>         On Thu, Jan 17, 2013 at 6:19 PM, Connor Woodson
>>         <cwoodson.dev@gmail.com <ma...@gmail.com>> wrote:
>>
>>             Whether idleTimeout is lower or higher than rollInterval
>>             is a preference; set it before, and assume you get one
>>             message right on the turn of the hour, then you will have
>>             some part of that hour without any bucket writers; but if
>>             you get another message at the end of the hour, you will
>>             end up with two files instead of one. Set it idleTimeout
>>             to be longer and you will get just one file, but also (at
>>             worst case) you will have twice as many bucketwriters
>>             open; so it all depends on how many files you want/how
>>             much memory you have to spare.
>>
>>             - Connor
>>
>>             An aside:
>>             bucketwriters, after being closed by rollInterval, aren't
>>             really a memory leak; they just are very rarely useful to
>>             keep around (your path could rely on hostname, and you
>>             could use a rollinterval, and then those bucketwriters
>>             will still remain useful). And they will get removed
>>             eventually; by default after you've created your 5001st
>>             bucketwriter, the first (or whichever was used longest
>>             ago) will be removed.
>>
>>             And I don't think that's the cause behind 1850 as he did
>>             have an idleTimeout set at 15 minutes.
>>
>>
>>             On Thu, Jan 17, 2013 at 6:08 PM, Juhani Connolly
>>             <juhani_connolly@cyberagent.co.jp
>>             <ma...@cyberagent.co.jp>> wrote:
>>
>>                 It's also useful if you want files to get promptly
>>                 closed and renamed from the .tmp or whatever.
>>
>>                 We use it with something like 30seconds setting(we
>>                 have a constant stream of data) and hourly bucketing.
>>
>>                 There is also the issue that files closed by
>>                 rollInterval are never removed from the internal
>>                 linkedList so it actually causes a small memory
>>                 leak(which can get big in the long term if you have a
>>                 lot of files and hourly renames). I believe this is
>>                 what is causing the OOM Mohit is getting in FLUME-1850
>>
>>                 So I personally would recommend using it(with a
>>                 setting that will close files before rollInterval does).
>>
>>
>>                 On 01/18/2013 06:38 AM, Bhaskar V. Karambelkar wrote:
>>
>>                     Ah I see. Again something useful to have in the
>>                     flume user guide.
>>
>>                     On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson
>>                     <cwoodson.dev@gmail.com
>>                     <ma...@gmail.com>> wrote:
>>
>>                         the rollInterval will still cause the last
>>                         01-17 file to be closed
>>                         eventually. The way the HDFS sink works with
>>                         the different files is each
>>                         unique path is specified by a different
>>                         BucketWriter object. The sink can
>>                         hold as many objects as specified by
>>                         hdfs.maxOpenWorkers (default: 5000),
>>                         and bucketwriters are only removed when you
>>                         create the 5001th writer (5001th
>>                         unique path). However, generally once a
>>                         writer is closed it is never used
>>                         again (all of your 1-17 writers will never be
>>                         used again). To avoid keeping
>>                         them in the sink's internal list of writers,
>>                         the idleTimeout is a specified
>>                         number of seconds in which no data is
>>                         received by the BucketWriter. After
>>                         this time, the writer will try to close
>>                         itself and will then tell the sink
>>                         to remove it, thus freeing up everything used
>>                         by the bucketwriter.
>>
>>                         So the idleTimeout is just a setting to help
>>                         limit memory usage by the hdfs
>>                         sink. The ideal time for it is longer than
>>                         the maximum time between events
>>                         (capped at the rollInterval) - if you know
>>                         you'll receive a constant stream
>>                         of events you might just set it to a minute
>>                         or something. Or if you are fine
>>                         with having multiple files open per hour, you
>>                         can set it to a lower number;
>>                         maybe just over the average time between
>>                         events. For me in just testing, I
>>                         set it >= rollInterval for the cases when no
>>                         events are received in a given
>>                         hour (I'd rather keep the object alive for an
>>                         extra hour than create files
>>                         every 30 minutes or something).
>>
>>                         Hope that was helpful,
>>
>>                         - Connor
>>
>>
>>                         On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V.
>>                         Karambelkar
>>                         <bhaskarvk@gmail.com
>>                         <ma...@gmail.com>> wrote:
>>
>>                             Say If I have
>>
>>                             a1.sinks.k1.hdfs.path =
>>                             /flume/events/%y-%m-%d/
>>
>>                             hdfs.rollInterval=60
>>
>>                             Now, if there is a file
>>                             /flume/events/2013-01-17/flume_XXXXXXXXX.tmp
>>                             This file is not ready to be rolled over
>>                             yet, i.e. 60 seconds are not
>>                             up and now it's past 12 midnight, i.e.
>>                             new day
>>                             And events start to be written to
>>                             /flume/events/2013-01-18/flume_XXXXXXXX.tmp
>>
>>                             will the file 2013-01-17 never be rolled
>>                             over, unless I have something
>>                             like hdfs.idleTimeout=60  ?
>>                             If so how do flume sinks keep track of
>>                             files they need to rollover
>>                             after idealTimeout ?
>>
>>                             In short what's the exact use of
>>                             idealTimeout parameter ?
>>
>>
>>
>>
>>
>>
>
>

Re: hdfs.idleTimeout ,what's it used for ?

Posted by Connor Woodson <cw...@gmail.com>.

The way idleTimeout works right now is that it's another rollInterval; it
will work best when rollInterval is not set and so it seems that it's use
is best for when you don't want to use a rollInterval and just want to have
your bucketwriters close when no events are coming through (caused by path
change or something else; and you can still roll reliably with either count
or size)

As such, perhaps it is more clear if idleTimeout is renamed to idleRoll or
such?

And then change idleTimeout to only count seconds since it was closed; if a
bucketwriter is closed for long enough it will automatically remove itself.
This type of idle will then work well with rollInterval, while the other
one doesn't (idleRoll + rollInterval creates two time-based rollers. There
are certainly times for that, but not all of the time).

- Connor


On Thu, Jan 17, 2013 at 6:46 PM, Juhani Connolly <
juhani_connolly@cyberagent.co.jp> wrote:

>  It seemed neater at the time. It's only an issue because rollInterval
> doesn't remove the entry in sfWriters. We could change it so that close
> doesn't cancel it, and have it check whether or not the writer is already
> closed, but that'd be kind of ugly.
>
> @Mohit:
>
> When flume dies unexpectedly the .tmp file remains. When it restarts there
> is some logic in HDFS sink to recover it(and continue writing from there).
> I'm not actually sure of the specifics. You may want to try and just kill
> -9 a running flume process on a test machine and then start it up, look at
> the logs and see what happens with the output.
>
> If flume dies cleanly the file is properly closed.
>
>
> On 01/18/2013 11:23 AM, Connor Woodson wrote:
>
> And @ my aside: I hadn't realized that the idleTimeout is canceled by the
> rollInterval occurring. That's annoying. So setting a lower idleTimeout,
> and drastically decreasing maxOpenFiles to at most 2 * possible open files,
> is probably necessary.
>
>
> On Thu, Jan 17, 2013 at 6:20 PM, Connor Woodson <cw...@gmail.com>wrote:
>
>> @Mohit:
>>
>>  For the HDFS Sink, the tmp files are placed based on the hadoop.tmp.dir
>> property. The default location is /tmp/hadoop-${user.name} To change
>> this you can add -Dhadoop.tmp.dir=<path> to your Flume command line call,
>> or you can specify the property in the core-site.xml of wherever your
>> HADOOP_HOME environment variable points to.
>>
>>  - Connor
>>
>>
>> On Thu, Jan 17, 2013 at 6:19 PM, Connor Woodson <cw...@gmail.com>wrote:
>>
>>>  Whether idleTimeout is lower or higher than rollInterval is a
>>> preference; set it before, and assume you get one message right on the turn
>>> of the hour, then you will have some part of that hour without any bucket
>>> writers; but if you get another message at the end of the hour, you will
>>> end up with two files instead of one. Set it idleTimeout to be longer and
>>> you will get just one file, but also (at worst case) you will have twice as
>>> many bucketwriters open; so it all depends on how many files you want/how
>>> much memory you have to spare.
>>>
>>>  - Connor
>>>
>>>  An aside:
>>> bucketwriters, after being closed by rollInterval, aren't really a
>>> memory leak; they just are very rarely useful to keep around (your path
>>> could rely on hostname, and you could use a rollinterval, and then those
>>> bucketwriters will still remain useful). And they will get removed
>>> eventually; by default after you've created your 5001st bucketwriter, the
>>> first (or whichever was used longest ago) will be removed.
>>>
>>>  And I don't think that's the cause behind 1850 as he did have an
>>> idleTimeout set at 15 minutes.
>>>
>>>
>>>  On Thu, Jan 17, 2013 at 6:08 PM, Juhani Connolly <
>>> juhani_connolly@cyberagent.co.jp> wrote:
>>>
>>>> It's also useful if you want files to get promptly closed and renamed
>>>> from the .tmp or whatever.
>>>>
>>>> We use it with something like 30seconds setting(we have a constant
>>>> stream of data) and hourly bucketing.
>>>>
>>>> There is also the issue that files closed by rollInterval are never
>>>> removed from the internal linkedList so it actually causes a small memory
>>>> leak(which can get big in the long term if you have a lot of files and
>>>> hourly renames). I believe this is what is causing the OOM Mohit is getting
>>>> in FLUME-1850
>>>>
>>>> So I personally would recommend using it(with a setting that will close
>>>> files before rollInterval does).
>>>>
>>>>
>>>> On 01/18/2013 06:38 AM, Bhaskar V. Karambelkar wrote:
>>>>
>>>>> Ah I see. Again something useful to have in the flume user guide.
>>>>>
>>>>> On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson <
>>>>> cwoodson.dev@gmail.com> wrote:
>>>>>
>>>>>> the rollInterval will still cause the last 01-17 file to be closed
>>>>>> eventually. The way the HDFS sink works with the different files is
>>>>>> each
>>>>>> unique path is specified by a different BucketWriter object. The sink
>>>>>> can
>>>>>> hold as many objects as specified by hdfs.maxOpenWorkers (default:
>>>>>> 5000),
>>>>>> and bucketwriters are only removed when you create the 5001th writer
>>>>>> (5001th
>>>>>> unique path). However, generally once a writer is closed it is never
>>>>>> used
>>>>>> again (all of your 1-17 writers will never be used again). To avoid
>>>>>> keeping
>>>>>> them in the sink's internal list of writers, the idleTimeout is a
>>>>>> specified
>>>>>> number of seconds in which no data is received by the BucketWriter.
>>>>>> After
>>>>>> this time, the writer will try to close itself and will then tell the
>>>>>> sink
>>>>>> to remove it, thus freeing up everything used by the bucketwriter.
>>>>>>
>>>>>> So the idleTimeout is just a setting to help limit memory usage by
>>>>>> the hdfs
>>>>>> sink. The ideal time for it is longer than the maximum time between
>>>>>> events
>>>>>> (capped at the rollInterval) - if you know you'll receive a constant
>>>>>> stream
>>>>>> of events you might just set it to a minute or something. Or if you
>>>>>> are fine
>>>>>> with having multiple files open per hour, you can set it to a lower
>>>>>> number;
>>>>>> maybe just over the average time between events. For me in just
>>>>>> testing, I
>>>>>> set it >= rollInterval for the cases when no events are received in a
>>>>>> given
>>>>>> hour (I'd rather keep the object alive for an extra hour than create
>>>>>> files
>>>>>> every 30 minutes or something).
>>>>>>
>>>>>> Hope that was helpful,
>>>>>>
>>>>>> - Connor
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V. Karambelkar
>>>>>> <bh...@gmail.com> wrote:
>>>>>>
>>>>>>> Say If I have
>>>>>>>
>>>>>>> a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/
>>>>>>>
>>>>>>> hdfs.rollInterval=60
>>>>>>>
>>>>>>> Now, if there is a file
>>>>>>> /flume/events/2013-01-17/flume_XXXXXXXXX.tmp
>>>>>>> This file is not ready to be rolled over yet, i.e. 60 seconds are not
>>>>>>> up and now it's past 12 midnight, i.e. new day
>>>>>>> And events start to be written to
>>>>>>> /flume/events/2013-01-18/flume_XXXXXXXX.tmp
>>>>>>>
>>>>>>> will the file 2013-01-17 never be rolled over, unless I have
>>>>>>> something
>>>>>>> like hdfs.idleTimeout=60  ?
>>>>>>> If so how do flume sinks keep track of files they need to rollover
>>>>>>> after idealTimeout ?
>>>>>>>
>>>>>>> In short what's the exact use of idealTimeout parameter ?
>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>
>>
>
>

Re: hdfs.idleTimeout ,what's it used for ?

Posted by Juhani Connolly <ju...@cyberagent.co.jp>.

It seemed neater at the time. It's only an issue because rollInterval 
doesn't remove the entry in sfWriters. We could change it so that close 
doesn't cancel it, and have it check whether or not the writer is 
already closed, but that'd be kind of ugly.

@Mohit:

When flume dies unexpectedly the .tmp file remains. When it restarts 
there is some logic in HDFS sink to recover it(and continue writing from 
there). I'm not actually sure of the specifics. You may want to try and 
just kill -9 a running flume process on a test machine and then start it 
up, look at the logs and see what happens with the output.

If flume dies cleanly the file is properly closed.

On 01/18/2013 11:23 AM, Connor Woodson wrote:
> And @ my aside: I hadn't realized that the idleTimeout is canceled by 
> the rollInterval occurring. That's annoying. So setting a lower 
> idleTimeout, and drastically decreasing maxOpenFiles to at most 2 * 
> possible open files, is probably necessary.
>
>
> On Thu, Jan 17, 2013 at 6:20 PM, Connor Woodson 
> <cwoodson.dev@gmail.com <ma...@gmail.com>> wrote:
>
>     @Mohit:
>
>     For the HDFS Sink, the tmp files are placed based on the
>     hadoop.tmp.dir property. The default location is
>     /tmp/hadoop-${user.name <http://user.name>} To change this you can
>     add -Dhadoop.tmp.dir=<path> to your Flume command line call, or
>     you can specify the property in the core-site.xml of wherever your
>     HADOOP_HOME environment variable points to.
>
>     - Connor
>
>
>     On Thu, Jan 17, 2013 at 6:19 PM, Connor Woodson
>     <cwoodson.dev@gmail.com <ma...@gmail.com>> wrote:
>
>         Whether idleTimeout is lower or higher than rollInterval is a
>         preference; set it before, and assume you get one message
>         right on the turn of the hour, then you will have some part of
>         that hour without any bucket writers; but if you get another
>         message at the end of the hour, you will end up with two files
>         instead of one. Set it idleTimeout to be longer and you will
>         get just one file, but also (at worst case) you will have
>         twice as many bucketwriters open; so it all depends on how
>         many files you want/how much memory you have to spare.
>
>         - Connor
>
>         An aside:
>         bucketwriters, after being closed by rollInterval, aren't
>         really a memory leak; they just are very rarely useful to keep
>         around (your path could rely on hostname, and you could use a
>         rollinterval, and then those bucketwriters will still remain
>         useful). And they will get removed eventually; by default
>         after you've created your 5001st bucketwriter, the first (or
>         whichever was used longest ago) will be removed.
>
>         And I don't think that's the cause behind 1850 as he did have
>         an idleTimeout set at 15 minutes.
>
>
>         On Thu, Jan 17, 2013 at 6:08 PM, Juhani Connolly
>         <juhani_connolly@cyberagent.co.jp
>         <ma...@cyberagent.co.jp>> wrote:
>
>             It's also useful if you want files to get promptly closed
>             and renamed from the .tmp or whatever.
>
>             We use it with something like 30seconds setting(we have a
>             constant stream of data) and hourly bucketing.
>
>             There is also the issue that files closed by rollInterval
>             are never removed from the internal linkedList so it
>             actually causes a small memory leak(which can get big in
>             the long term if you have a lot of files and hourly
>             renames). I believe this is what is causing the OOM Mohit
>             is getting in FLUME-1850
>
>             So I personally would recommend using it(with a setting
>             that will close files before rollInterval does).
>
>
>             On 01/18/2013 06:38 AM, Bhaskar V. Karambelkar wrote:
>
>                 Ah I see. Again something useful to have in the flume
>                 user guide.
>
>                 On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson
>                 <cwoodson.dev@gmail.com
>                 <ma...@gmail.com>> wrote:
>
>                     the rollInterval will still cause the last 01-17
>                     file to be closed
>                     eventually. The way the HDFS sink works with the
>                     different files is each
>                     unique path is specified by a different
>                     BucketWriter object. The sink can
>                     hold as many objects as specified by
>                     hdfs.maxOpenWorkers (default: 5000),
>                     and bucketwriters are only removed when you create
>                     the 5001th writer (5001th
>                     unique path). However, generally once a writer is
>                     closed it is never used
>                     again (all of your 1-17 writers will never be used
>                     again). To avoid keeping
>                     them in the sink's internal list of writers, the
>                     idleTimeout is a specified
>                     number of seconds in which no data is received by
>                     the BucketWriter. After
>                     this time, the writer will try to close itself and
>                     will then tell the sink
>                     to remove it, thus freeing up everything used by
>                     the bucketwriter.
>
>                     So the idleTimeout is just a setting to help limit
>                     memory usage by the hdfs
>                     sink. The ideal time for it is longer than the
>                     maximum time between events
>                     (capped at the rollInterval) - if you know you'll
>                     receive a constant stream
>                     of events you might just set it to a minute or
>                     something. Or if you are fine
>                     with having multiple files open per hour, you can
>                     set it to a lower number;
>                     maybe just over the average time between events.
>                     For me in just testing, I
>                     set it >= rollInterval for the cases when no
>                     events are received in a given
>                     hour (I'd rather keep the object alive for an
>                     extra hour than create files
>                     every 30 minutes or something).
>
>                     Hope that was helpful,
>
>                     - Connor
>
>
>                     On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V.
>                     Karambelkar
>                     <bhaskarvk@gmail.com <ma...@gmail.com>>
>                     wrote:
>
>                         Say If I have
>
>                         a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/
>
>                         hdfs.rollInterval=60
>
>                         Now, if there is a file
>                         /flume/events/2013-01-17/flume_XXXXXXXXX.tmp
>                         This file is not ready to be rolled over yet,
>                         i.e. 60 seconds are not
>                         up and now it's past 12 midnight, i.e. new day
>                         And events start to be written to
>                         /flume/events/2013-01-18/flume_XXXXXXXX.tmp
>
>                         will the file 2013-01-17 never be rolled over,
>                         unless I have something
>                         like hdfs.idleTimeout=60  ?
>                         If so how do flume sinks keep track of files
>                         they need to rollover
>                         after idealTimeout ?
>
>                         In short what's the exact use of idealTimeout
>                         parameter ?
>
>
>
>
>
>

Re: hdfs.idleTimeout ,what's it used for ?

Posted by Connor Woodson <cw...@gmail.com>.

And @ my aside: I hadn't realized that the idleTimeout is canceled by the
rollInterval occurring. That's annoying. So setting a lower idleTimeout,
and drastically decreasing maxOpenFiles to at most 2 * possible open files,
is probably necessary.


On Thu, Jan 17, 2013 at 6:20 PM, Connor Woodson <cw...@gmail.com>wrote:

> @Mohit:
>
> For the HDFS Sink, the tmp files are placed based on the hadoop.tmp.dir
> property. The default location is /tmp/hadoop-${user.name} To change this
> you can add -Dhadoop.tmp.dir=<path> to your Flume command line call, or you
> can specify the property in the core-site.xml of wherever your HADOOP_HOME
> environment variable points to.
>
> - Connor
>
>
> On Thu, Jan 17, 2013 at 6:19 PM, Connor Woodson <cw...@gmail.com>wrote:
>
>> Whether idleTimeout is lower or higher than rollInterval is a preference;
>> set it before, and assume you get one message right on the turn of the
>> hour, then you will have some part of that hour without any bucket writers;
>> but if you get another message at the end of the hour, you will end up with
>> two files instead of one. Set it idleTimeout to be longer and you will get
>> just one file, but also (at worst case) you will have twice as many
>> bucketwriters open; so it all depends on how many files you want/how much
>> memory you have to spare.
>>
>> - Connor
>>
>> An aside:
>> bucketwriters, after being closed by rollInterval, aren't really a memory
>> leak; they just are very rarely useful to keep around (your path could rely
>> on hostname, and you could use a rollinterval, and then those bucketwriters
>> will still remain useful). And they will get removed eventually; by default
>> after you've created your 5001st bucketwriter, the first (or whichever was
>> used longest ago) will be removed.
>>
>> And I don't think that's the cause behind 1850 as he did have an
>> idleTimeout set at 15 minutes.
>>
>>
>> On Thu, Jan 17, 2013 at 6:08 PM, Juhani Connolly <
>> juhani_connolly@cyberagent.co.jp> wrote:
>>
>>> It's also useful if you want files to get promptly closed and renamed
>>> from the .tmp or whatever.
>>>
>>> We use it with something like 30seconds setting(we have a constant
>>> stream of data) and hourly bucketing.
>>>
>>> There is also the issue that files closed by rollInterval are never
>>> removed from the internal linkedList so it actually causes a small memory
>>> leak(which can get big in the long term if you have a lot of files and
>>> hourly renames). I believe this is what is causing the OOM Mohit is getting
>>> in FLUME-1850
>>>
>>> So I personally would recommend using it(with a setting that will close
>>> files before rollInterval does).
>>>
>>>
>>> On 01/18/2013 06:38 AM, Bhaskar V. Karambelkar wrote:
>>>
>>>> Ah I see. Again something useful to have in the flume user guide.
>>>>
>>>> On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson <cw...@gmail.com>
>>>> wrote:
>>>>
>>>>> the rollInterval will still cause the last 01-17 file to be closed
>>>>> eventually. The way the HDFS sink works with the different files is
>>>>> each
>>>>> unique path is specified by a different BucketWriter object. The sink
>>>>> can
>>>>> hold as many objects as specified by hdfs.maxOpenWorkers (default:
>>>>> 5000),
>>>>> and bucketwriters are only removed when you create the 5001th writer
>>>>> (5001th
>>>>> unique path). However, generally once a writer is closed it is never
>>>>> used
>>>>> again (all of your 1-17 writers will never be used again). To avoid
>>>>> keeping
>>>>> them in the sink's internal list of writers, the idleTimeout is a
>>>>> specified
>>>>> number of seconds in which no data is received by the BucketWriter.
>>>>> After
>>>>> this time, the writer will try to close itself and will then tell the
>>>>> sink
>>>>> to remove it, thus freeing up everything used by the bucketwriter.
>>>>>
>>>>> So the idleTimeout is just a setting to help limit memory usage by the
>>>>> hdfs
>>>>> sink. The ideal time for it is longer than the maximum time between
>>>>> events
>>>>> (capped at the rollInterval) - if you know you'll receive a constant
>>>>> stream
>>>>> of events you might just set it to a minute or something. Or if you
>>>>> are fine
>>>>> with having multiple files open per hour, you can set it to a lower
>>>>> number;
>>>>> maybe just over the average time between events. For me in just
>>>>> testing, I
>>>>> set it >= rollInterval for the cases when no events are received in a
>>>>> given
>>>>> hour (I'd rather keep the object alive for an extra hour than create
>>>>> files
>>>>> every 30 minutes or something).
>>>>>
>>>>> Hope that was helpful,
>>>>>
>>>>> - Connor
>>>>>
>>>>>
>>>>> On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V. Karambelkar
>>>>> <bh...@gmail.com> wrote:
>>>>>
>>>>>> Say If I have
>>>>>>
>>>>>> a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/
>>>>>>
>>>>>> hdfs.rollInterval=60
>>>>>>
>>>>>> Now, if there is a file
>>>>>> /flume/events/2013-01-17/**flume_XXXXXXXXX.tmp
>>>>>> This file is not ready to be rolled over yet, i.e. 60 seconds are not
>>>>>> up and now it's past 12 midnight, i.e. new day
>>>>>> And events start to be written to
>>>>>> /flume/events/2013-01-18/**flume_XXXXXXXX.tmp
>>>>>>
>>>>>> will the file 2013-01-17 never be rolled over, unless I have something
>>>>>> like hdfs.idleTimeout=60  ?
>>>>>> If so how do flume sinks keep track of files they need to rollover
>>>>>> after idealTimeout ?
>>>>>>
>>>>>> In short what's the exact use of idealTimeout parameter ?
>>>>>>
>>>>>
>>>>>
>>>
>>
>

Re: hdfs.idleTimeout ,what's it used for ?

Posted by Connor Woodson <cw...@gmail.com>.

@Mohit:

For the HDFS Sink, the tmp files are placed based on the hadoop.tmp.dir
property. The default location is /tmp/hadoop-${user.name} To change this
you can add -Dhadoop.tmp.dir=<path> to your Flume command line call, or you
can specify the property in the core-site.xml of wherever your HADOOP_HOME
environment variable points to.

- Connor


On Thu, Jan 17, 2013 at 6:19 PM, Connor Woodson <cw...@gmail.com>wrote:

> Whether idleTimeout is lower or higher than rollInterval is a preference;
> set it before, and assume you get one message right on the turn of the
> hour, then you will have some part of that hour without any bucket writers;
> but if you get another message at the end of the hour, you will end up with
> two files instead of one. Set it idleTimeout to be longer and you will get
> just one file, but also (at worst case) you will have twice as many
> bucketwriters open; so it all depends on how many files you want/how much
> memory you have to spare.
>
> - Connor
>
> An aside:
> bucketwriters, after being closed by rollInterval, aren't really a memory
> leak; they just are very rarely useful to keep around (your path could rely
> on hostname, and you could use a rollinterval, and then those bucketwriters
> will still remain useful). And they will get removed eventually; by default
> after you've created your 5001st bucketwriter, the first (or whichever was
> used longest ago) will be removed.
>
> And I don't think that's the cause behind 1850 as he did have an
> idleTimeout set at 15 minutes.
>
>
> On Thu, Jan 17, 2013 at 6:08 PM, Juhani Connolly <
> juhani_connolly@cyberagent.co.jp> wrote:
>
>> It's also useful if you want files to get promptly closed and renamed
>> from the .tmp or whatever.
>>
>> We use it with something like 30seconds setting(we have a constant stream
>> of data) and hourly bucketing.
>>
>> There is also the issue that files closed by rollInterval are never
>> removed from the internal linkedList so it actually causes a small memory
>> leak(which can get big in the long term if you have a lot of files and
>> hourly renames). I believe this is what is causing the OOM Mohit is getting
>> in FLUME-1850
>>
>> So I personally would recommend using it(with a setting that will close
>> files before rollInterval does).
>>
>>
>> On 01/18/2013 06:38 AM, Bhaskar V. Karambelkar wrote:
>>
>>> Ah I see. Again something useful to have in the flume user guide.
>>>
>>> On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson <cw...@gmail.com>
>>> wrote:
>>>
>>>> the rollInterval will still cause the last 01-17 file to be closed
>>>> eventually. The way the HDFS sink works with the different files is each
>>>> unique path is specified by a different BucketWriter object. The sink
>>>> can
>>>> hold as many objects as specified by hdfs.maxOpenWorkers (default:
>>>> 5000),
>>>> and bucketwriters are only removed when you create the 5001th writer
>>>> (5001th
>>>> unique path). However, generally once a writer is closed it is never
>>>> used
>>>> again (all of your 1-17 writers will never be used again). To avoid
>>>> keeping
>>>> them in the sink's internal list of writers, the idleTimeout is a
>>>> specified
>>>> number of seconds in which no data is received by the BucketWriter.
>>>> After
>>>> this time, the writer will try to close itself and will then tell the
>>>> sink
>>>> to remove it, thus freeing up everything used by the bucketwriter.
>>>>
>>>> So the idleTimeout is just a setting to help limit memory usage by the
>>>> hdfs
>>>> sink. The ideal time for it is longer than the maximum time between
>>>> events
>>>> (capped at the rollInterval) - if you know you'll receive a constant
>>>> stream
>>>> of events you might just set it to a minute or something. Or if you are
>>>> fine
>>>> with having multiple files open per hour, you can set it to a lower
>>>> number;
>>>> maybe just over the average time between events. For me in just
>>>> testing, I
>>>> set it >= rollInterval for the cases when no events are received in a
>>>> given
>>>> hour (I'd rather keep the object alive for an extra hour than create
>>>> files
>>>> every 30 minutes or something).
>>>>
>>>> Hope that was helpful,
>>>>
>>>> - Connor
>>>>
>>>>
>>>> On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V. Karambelkar
>>>> <bh...@gmail.com> wrote:
>>>>
>>>>> Say If I have
>>>>>
>>>>> a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/
>>>>>
>>>>> hdfs.rollInterval=60
>>>>>
>>>>> Now, if there is a file
>>>>> /flume/events/2013-01-17/**flume_XXXXXXXXX.tmp
>>>>> This file is not ready to be rolled over yet, i.e. 60 seconds are not
>>>>> up and now it's past 12 midnight, i.e. new day
>>>>> And events start to be written to
>>>>> /flume/events/2013-01-18/**flume_XXXXXXXX.tmp
>>>>>
>>>>> will the file 2013-01-17 never be rolled over, unless I have something
>>>>> like hdfs.idleTimeout=60  ?
>>>>> If so how do flume sinks keep track of files they need to rollover
>>>>> after idealTimeout ?
>>>>>
>>>>> In short what's the exact use of idealTimeout parameter ?
>>>>>
>>>>
>>>>
>>
>

Re: hdfs.idleTimeout ,what's it used for ?

Posted by Connor Woodson <cw...@gmail.com>.

Whether idleTimeout is lower or higher than rollInterval is a preference;
set it before, and assume you get one message right on the turn of the
hour, then you will have some part of that hour without any bucket writers;
but if you get another message at the end of the hour, you will end up with
two files instead of one. Set it idleTimeout to be longer and you will get
just one file, but also (at worst case) you will have twice as many
bucketwriters open; so it all depends on how many files you want/how much
memory you have to spare.

- Connor

An aside:
bucketwriters, after being closed by rollInterval, aren't really a memory
leak; they just are very rarely useful to keep around (your path could rely
on hostname, and you could use a rollinterval, and then those bucketwriters
will still remain useful). And they will get removed eventually; by default
after you've created your 5001st bucketwriter, the first (or whichever was
used longest ago) will be removed.

And I don't think that's the cause behind 1850 as he did have an
idleTimeout set at 15 minutes.


On Thu, Jan 17, 2013 at 6:08 PM, Juhani Connolly <
juhani_connolly@cyberagent.co.jp> wrote:

> It's also useful if you want files to get promptly closed and renamed from
> the .tmp or whatever.
>
> We use it with something like 30seconds setting(we have a constant stream
> of data) and hourly bucketing.
>
> There is also the issue that files closed by rollInterval are never
> removed from the internal linkedList so it actually causes a small memory
> leak(which can get big in the long term if you have a lot of files and
> hourly renames). I believe this is what is causing the OOM Mohit is getting
> in FLUME-1850
>
> So I personally would recommend using it(with a setting that will close
> files before rollInterval does).
>
>
> On 01/18/2013 06:38 AM, Bhaskar V. Karambelkar wrote:
>
>> Ah I see. Again something useful to have in the flume user guide.
>>
>> On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson <cw...@gmail.com>
>> wrote:
>>
>>> the rollInterval will still cause the last 01-17 file to be closed
>>> eventually. The way the HDFS sink works with the different files is each
>>> unique path is specified by a different BucketWriter object. The sink can
>>> hold as many objects as specified by hdfs.maxOpenWorkers (default: 5000),
>>> and bucketwriters are only removed when you create the 5001th writer
>>> (5001th
>>> unique path). However, generally once a writer is closed it is never used
>>> again (all of your 1-17 writers will never be used again). To avoid
>>> keeping
>>> them in the sink's internal list of writers, the idleTimeout is a
>>> specified
>>> number of seconds in which no data is received by the BucketWriter. After
>>> this time, the writer will try to close itself and will then tell the
>>> sink
>>> to remove it, thus freeing up everything used by the bucketwriter.
>>>
>>> So the idleTimeout is just a setting to help limit memory usage by the
>>> hdfs
>>> sink. The ideal time for it is longer than the maximum time between
>>> events
>>> (capped at the rollInterval) - if you know you'll receive a constant
>>> stream
>>> of events you might just set it to a minute or something. Or if you are
>>> fine
>>> with having multiple files open per hour, you can set it to a lower
>>> number;
>>> maybe just over the average time between events. For me in just testing,
>>> I
>>> set it >= rollInterval for the cases when no events are received in a
>>> given
>>> hour (I'd rather keep the object alive for an extra hour than create
>>> files
>>> every 30 minutes or something).
>>>
>>> Hope that was helpful,
>>>
>>> - Connor
>>>
>>>
>>> On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V. Karambelkar
>>> <bh...@gmail.com> wrote:
>>>
>>>> Say If I have
>>>>
>>>> a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/
>>>>
>>>> hdfs.rollInterval=60
>>>>
>>>> Now, if there is a file
>>>> /flume/events/2013-01-17/**flume_XXXXXXXXX.tmp
>>>> This file is not ready to be rolled over yet, i.e. 60 seconds are not
>>>> up and now it's past 12 midnight, i.e. new day
>>>> And events start to be written to
>>>> /flume/events/2013-01-18/**flume_XXXXXXXX.tmp
>>>>
>>>> will the file 2013-01-17 never be rolled over, unless I have something
>>>> like hdfs.idleTimeout=60  ?
>>>> If so how do flume sinks keep track of files they need to rollover
>>>> after idealTimeout ?
>>>>
>>>> In short what's the exact use of idealTimeout parameter ?
>>>>
>>>
>>>
>

Re: hdfs.idleTimeout ,what's it used for ?

Posted by Juhani Connolly <ju...@cyberagent.co.jp>.

It's also useful if you want files to get promptly closed and renamed 
from the .tmp or whatever.

We use it with something like 30seconds setting(we have a constant 
stream of data) and hourly bucketing.

There is also the issue that files closed by rollInterval are never 
removed from the internal linkedList so it actually causes a small 
memory leak(which can get big in the long term if you have a lot of 
files and hourly renames). I believe this is what is causing the OOM 
Mohit is getting in FLUME-1850

So I personally would recommend using it(with a setting that will close 
files before rollInterval does).

On 01/18/2013 06:38 AM, Bhaskar V. Karambelkar wrote:
> Ah I see. Again something useful to have in the flume user guide.
>
> On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson <cw...@gmail.com> wrote:
>> the rollInterval will still cause the last 01-17 file to be closed
>> eventually. The way the HDFS sink works with the different files is each
>> unique path is specified by a different BucketWriter object. The sink can
>> hold as many objects as specified by hdfs.maxOpenWorkers (default: 5000),
>> and bucketwriters are only removed when you create the 5001th writer (5001th
>> unique path). However, generally once a writer is closed it is never used
>> again (all of your 1-17 writers will never be used again). To avoid keeping
>> them in the sink's internal list of writers, the idleTimeout is a specified
>> number of seconds in which no data is received by the BucketWriter. After
>> this time, the writer will try to close itself and will then tell the sink
>> to remove it, thus freeing up everything used by the bucketwriter.
>>
>> So the idleTimeout is just a setting to help limit memory usage by the hdfs
>> sink. The ideal time for it is longer than the maximum time between events
>> (capped at the rollInterval) - if you know you'll receive a constant stream
>> of events you might just set it to a minute or something. Or if you are fine
>> with having multiple files open per hour, you can set it to a lower number;
>> maybe just over the average time between events. For me in just testing, I
>> set it >= rollInterval for the cases when no events are received in a given
>> hour (I'd rather keep the object alive for an extra hour than create files
>> every 30 minutes or something).
>>
>> Hope that was helpful,
>>
>> - Connor
>>
>>
>> On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V. Karambelkar
>> <bh...@gmail.com> wrote:
>>> Say If I have
>>>
>>> a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/
>>>
>>> hdfs.rollInterval=60
>>>
>>> Now, if there is a file
>>> /flume/events/2013-01-17/flume_XXXXXXXXX.tmp
>>> This file is not ready to be rolled over yet, i.e. 60 seconds are not
>>> up and now it's past 12 midnight, i.e. new day
>>> And events start to be written to
>>> /flume/events/2013-01-18/flume_XXXXXXXX.tmp
>>>
>>> will the file 2013-01-17 never be rolled over, unless I have something
>>> like hdfs.idleTimeout=60  ?
>>> If so how do flume sinks keep track of files they need to rollover
>>> after idealTimeout ?
>>>
>>> In short what's the exact use of idealTimeout parameter ?
>>

Re: hdfs.idleTimeout ,what's it used for ?

Posted by "Bhaskar V. Karambelkar" <bh...@gmail.com>.

Ah I see. Again something useful to have in the flume user guide.

On Thu, Jan 17, 2013 at 3:29 PM, Connor Woodson <cw...@gmail.com> wrote:
> the rollInterval will still cause the last 01-17 file to be closed
> eventually. The way the HDFS sink works with the different files is each
> unique path is specified by a different BucketWriter object. The sink can
> hold as many objects as specified by hdfs.maxOpenWorkers (default: 5000),
> and bucketwriters are only removed when you create the 5001th writer (5001th
> unique path). However, generally once a writer is closed it is never used
> again (all of your 1-17 writers will never be used again). To avoid keeping
> them in the sink's internal list of writers, the idleTimeout is a specified
> number of seconds in which no data is received by the BucketWriter. After
> this time, the writer will try to close itself and will then tell the sink
> to remove it, thus freeing up everything used by the bucketwriter.
>
> So the idleTimeout is just a setting to help limit memory usage by the hdfs
> sink. The ideal time for it is longer than the maximum time between events
> (capped at the rollInterval) - if you know you'll receive a constant stream
> of events you might just set it to a minute or something. Or if you are fine
> with having multiple files open per hour, you can set it to a lower number;
> maybe just over the average time between events. For me in just testing, I
> set it >= rollInterval for the cases when no events are received in a given
> hour (I'd rather keep the object alive for an extra hour than create files
> every 30 minutes or something).
>
> Hope that was helpful,
>
> - Connor
>
>
> On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V. Karambelkar
> <bh...@gmail.com> wrote:
>>
>> Say If I have
>>
>> a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/
>>
>> hdfs.rollInterval=60
>>
>> Now, if there is a file
>> /flume/events/2013-01-17/flume_XXXXXXXXX.tmp
>> This file is not ready to be rolled over yet, i.e. 60 seconds are not
>> up and now it's past 12 midnight, i.e. new day
>> And events start to be written to
>> /flume/events/2013-01-18/flume_XXXXXXXX.tmp
>>
>> will the file 2013-01-17 never be rolled over, unless I have something
>> like hdfs.idleTimeout=60  ?
>> If so how do flume sinks keep track of files they need to rollover
>> after idealTimeout ?
>>
>> In short what's the exact use of idealTimeout parameter ?
>
>

Re: hdfs.idleTimeout ,what's it used for ?

Posted by Connor Woodson <cw...@gmail.com>.

the rollInterval will still cause the last 01-17 file to be closed
eventually. The way the HDFS sink works with the different files is each
unique path is specified by a different BucketWriter object. The sink can
hold as many objects as specified by hdfs.maxOpenWorkers (default: 5000),
and bucketwriters are only removed when you create the 5001th writer
(5001th unique path). However, generally once a writer is closed it is
never used again (all of your 1-17 writers will never be used again). To
avoid keeping them in the sink's internal list of writers, the idleTimeout
is a specified number of seconds in which no data is received by the
BucketWriter. After this time, the writer will try to close itself and will
then tell the sink to remove it, thus freeing up everything used by the
bucketwriter.

So the idleTimeout is just a setting to help limit memory usage by the hdfs
sink. The ideal time for it is longer than the maximum time between events
(capped at the rollInterval) - if you know you'll receive a constant stream
of events you might just set it to a minute or something. Or if you are
fine with having multiple files open per hour, you can set it to a lower
number; maybe just over the average time between events. For me in just
testing, I set it >= rollInterval for the cases when no events are received
in a given hour (I'd rather keep the object alive for an extra hour than
create files every 30 minutes or something).

Hope that was helpful,

- Connor

On Thu, Jan 17, 2013 at 12:07 PM, Bhaskar V. Karambelkar <
bhaskarvk@gmail.com> wrote:

> Say If I have
>
> a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/
>
> hdfs.rollInterval=60
>
> Now, if there is a file
> /flume/events/2013-01-17/flume_XXXXXXXXX.tmp
> This file is not ready to be rolled over yet, i.e. 60 seconds are not
> up and now it's past 12 midnight, i.e. new day
> And events start to be written to
> /flume/events/2013-01-18/flume_XXXXXXXX.tmp
>
> will the file 2013-01-17 never be rolled over, unless I have something
> like hdfs.idleTimeout=60  ?
> If so how do flume sinks keep track of files they need to rollover
> after idealTimeout ?
>
> In short what's the exact use of idealTimeout parameter ?
>