You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by Tim Driscoll <ti...@gmail.com> on 2013/05/21 18:59:19 UTC

HDFSEventSink Memory Leak Workarounds

Hello,

We have a Flume Agent (version 1.3.1) set up using the HDFSEventSink.  We
were noticing that we were running out of memory after a few days of
running, and believe we had pinpointed it to an issue with using the
hdfs.idleTimeout setting.  I believe this is fixed in 1.4 per FLUME-1864.

Our planned workaround was to just remove the idleTimeout setting, which
worked, but brought up another issue.  Since we are partitioning our data
by timestamp, at midnight, we rolled over to a new bucket/partition, opened
new bucket writers, and left the current bucket writers open.  Ideally the
idleTimeout would clean this up.  So instead of a slow steady leak, we're
encountering a 100MB leak every day.

Short of upgrading Flume, does anyone know of a configuration workaround
for this?  Currently we just bumped up the heap memory and I'm having to
restart our agents every few days, which obviously isn't ideal.

Is anyone else seeing issues like this?  Or how do others use the HDFS sink
to continuously write large amounts of logs from multiple source hosts?  I
can get more in-depth about our setup/environment if necessary.

Here's a snippet of the one of  our 4 HDFS Sink configs:
agent.sinks.rest-xaction-hdfs-sink.type = hdfs
agent.sinks.rest-xaction-hdfs-sink.channel = rest-xaction-chan
agent.sinks.rest-xaction-hdfs-sink.hdfs.path =
/user/svc-neb/rest_xaction_logs/date=%Y-%m-%d
agent.sinks.rest-xaction-hdfs-sink.hdfs.rollCount = 0
agent.sinks.rest-xaction-hdfs-sink.hdfs.rollSize = 0
agent.sinks.rest-xaction-hdfs-sink.hdfs.rollInterval = 3600
agent.sinks.rest-xaction-hdfs-sink.hdfs.idleTimeout = 300
agent.sinks.rest-xaction-hdfs-sink.hdfs.batchSize = 1000
agent.sinks.rest-xaction-hdfs-sink.hdfs.filePrefix = %{host}
agent.sinks.rest-xaction-hdfs-sink.hdfs.fileSuffix = .avro
agent.sinks.rest-xaction-hdfs-sink.hdfs.fileType = DataStream
agent.sinks.rest-xaction-hdfs-sink.serializer = avro_event

-Tim

Re: HDFSEventSink Memory Leak Workarounds

Posted by Tim Driscoll <ti...@gmail.com>.
Sounds like the expected behavior to me based on the message, though it's a
little confusing because it's caught in an IOException.

Somewhat related, we had our idleTimeout probably set too low, so the files
would close pretty often.  This was causing a memory leak for us, from what
I can tell this is due to FLUME-1864.  So I think it may be a good idea to
bump up the idleTimeout if you're constantly closing idle files.  I could
be wrong though, I would defer to the developers. :)


On Wed, May 22, 2013 at 8:58 AM, Paul Chavez <
pchavez@verticalsearchworks.com> wrote:

> **
> This thread reminded me to check my configs since I use a low idleTimeout
> and bucket events by hour. Turned out I still had the default rollInterval
> set so I disabled that and updated my configs.
>
> Now I see a log of exceptions logged as warnings in the log immediately
> following an idleTimeout:
>
> 8:55:40.663 AM INFO org.apache.flume.sink.hdfs.BucketWriter
> Closing idle bucketWriter
> /flume/WebLogs/datekey=20130522/hour=08/FlumeData.1369238128886.tmp
> 8:55:40.675 AM INFO org.apache.flume.sink.hdfs.BucketWriter
> Renaming
> /flume/WebLogs/datekey=20130522/hour=08/FlumeData.1369238128886.tmp to
> /flume/WebLogs/datekey=20130522/hour=08/FlumeData.1369238128886
> 8:55:40.677 AM WARN org.apache.flume.sink.hdfs.HDFSEventSink
> HDFS IO error
> java.io.IOException: This bucket writer was closed due to idling and this
> handle is thus no longer valid
>  at org.apache.flume.sink.hdfs.BucketWriter.append(BucketWriter.java:391)
>  at org.apache.flume.sink.hdfs.HDFSEventSink$2.call(HDFSEventSink.java:729)
>  at org.apache.flume.sink.hdfs.HDFSEventSink$2.call(HDFSEventSink.java:727)
>  at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>  at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
>  at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
>  at java.lang.Thread.run(Thread.java:662)
>
> Given these are logged WARN I have been assuming they are benign errors.
> Is that assumption correct?
>
> thanks,
> Paul Chavez
>
>  ------------------------------
> *From:* Connor Woodson [mailto:cwoodson.dev@gmail.com]
> *Sent:* Tuesday, May 21, 2013 2:13 PM
> *To:* user@flume.apache.org
> *Subject:* Re: HDFSEventSink Memory Leak Workarounds
>
>  The other property you will want to look at is maxOpenFiles, which is
> the number of file/paths held in memory at one time.
>
> If you search for the email thread with subject "hdfs.idleTimeout ,what's
> it used for ?" from back in January you will find a discussion along these
> lines. As a quick summary, if rollInterval is not set to 0, you should
> avoid using idleTimeout and should set maxOpenFiles to a reasonable number
> (the default is 500 which is too large; I think that default is changed for
> 1.4).
>
> - Connor
>
>
> On Tue, May 21, 2013 at 9:59 AM, Tim Driscoll <ti...@gmail.com>wrote:
>
>> Hello,
>>
>> We have a Flume Agent (version 1.3.1) set up using the HDFSEventSink.  We
>> were noticing that we were running out of memory after a few days of
>> running, and believe we had pinpointed it to an issue with using the
>> hdfs.idleTimeout setting.  I believe this is fixed in 1.4 per FLUME-1864.
>>
>> Our planned workaround was to just remove the idleTimeout setting, which
>> worked, but brought up another issue.  Since we are partitioning our data
>> by timestamp, at midnight, we rolled over to a new bucket/partition, opened
>> new bucket writers, and left the current bucket writers open.  Ideally the
>> idleTimeout would clean this up.  So instead of a slow steady leak, we're
>> encountering a 100MB leak every day.
>>
>> Short of upgrading Flume, does anyone know of a configuration workaround
>> for this?  Currently we just bumped up the heap memory and I'm having to
>> restart our agents every few days, which obviously isn't ideal.
>>
>> Is anyone else seeing issues like this?  Or how do others use the HDFS
>> sink to continuously write large amounts of logs from multiple source
>> hosts?  I can get more in-depth about our setup/environment if necessary.
>>
>> Here's a snippet of the one of  our 4 HDFS Sink configs:
>> agent.sinks.rest-xaction-hdfs-sink.type = hdfs
>> agent.sinks.rest-xaction-hdfs-sink.channel = rest-xaction-chan
>> agent.sinks.rest-xaction-hdfs-sink.hdfs.path =
>> /user/svc-neb/rest_xaction_logs/date=%Y-%m-%d
>> agent.sinks.rest-xaction-hdfs-sink.hdfs.rollCount = 0
>> agent.sinks.rest-xaction-hdfs-sink.hdfs.rollSize = 0
>> agent.sinks.rest-xaction-hdfs-sink.hdfs.rollInterval = 3600
>> agent.sinks.rest-xaction-hdfs-sink.hdfs.idleTimeout = 300
>> agent.sinks.rest-xaction-hdfs-sink.hdfs.batchSize = 1000
>> agent.sinks.rest-xaction-hdfs-sink.hdfs.filePrefix = %{host}
>> agent.sinks.rest-xaction-hdfs-sink.hdfs.fileSuffix = .avro
>> agent.sinks.rest-xaction-hdfs-sink.hdfs.fileType = DataStream
>> agent.sinks.rest-xaction-hdfs-sink.serializer = avro_event
>>
>> -Tim
>>
>
>

RE: HDFSEventSink Memory Leak Workarounds

Posted by Paul Chavez <pc...@verticalsearchworks.com>.
This thread reminded me to check my configs since I use a low idleTimeout and bucket events by hour. Turned out I still had the default rollInterval set so I disabled that and updated my configs.

Now I see a log of exceptions logged as warnings in the log immediately following an idleTimeout:

8:55:40.663 AM INFO org.apache.flume.sink.hdfs.BucketWriter
Closing idle bucketWriter /flume/WebLogs/datekey=20130522/hour=08/FlumeData.1369238128886.tmp
8:55:40.675 AM INFO org.apache.flume.sink.hdfs.BucketWriter
Renaming /flume/WebLogs/datekey=20130522/hour=08/FlumeData.1369238128886.tmp to /flume/WebLogs/datekey=20130522/hour=08/FlumeData.1369238128886
8:55:40.677 AM WARN org.apache.flume.sink.hdfs.HDFSEventSink
HDFS IO error
java.io.IOException: This bucket writer was closed due to idling and this handle is thus no longer valid
 at org.apache.flume.sink.hdfs.BucketWriter.append(BucketWriter.java:391)
 at org.apache.flume.sink.hdfs.HDFSEventSink$2.call(HDFSEventSink.java:729)
 at org.apache.flume.sink.hdfs.HDFSEventSink$2.call(HDFSEventSink.java:727)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
 at java.lang.Thread.run(Thread.java:662)

Given these are logged WARN I have been assuming they are benign errors. Is that assumption correct?

thanks,
Paul Chavez

________________________________
From: Connor Woodson [mailto:cwoodson.dev@gmail.com]
Sent: Tuesday, May 21, 2013 2:13 PM
To: user@flume.apache.org
Subject: Re: HDFSEventSink Memory Leak Workarounds

The other property you will want to look at is maxOpenFiles, which is the number of file/paths held in memory at one time.

If you search for the email thread with subject "hdfs.idleTimeout ,what's it used for ?" from back in January you will find a discussion along these lines. As a quick summary, if rollInterval is not set to 0, you should avoid using idleTimeout and should set maxOpenFiles to a reasonable number (the default is 500 which is too large; I think that default is changed for 1.4).

- Connor


On Tue, May 21, 2013 at 9:59 AM, Tim Driscoll <ti...@gmail.com>> wrote:
Hello,

We have a Flume Agent (version 1.3.1) set up using the HDFSEventSink.  We were noticing that we were running out of memory after a few days of running, and believe we had pinpointed it to an issue with using the hdfs.idleTimeout setting.  I believe this is fixed in 1.4 per FLUME-1864.

Our planned workaround was to just remove the idleTimeout setting, which worked, but brought up another issue.  Since we are partitioning our data by timestamp, at midnight, we rolled over to a new bucket/partition, opened new bucket writers, and left the current bucket writers open.  Ideally the idleTimeout would clean this up.  So instead of a slow steady leak, we're encountering a 100MB leak every day.

Short of upgrading Flume, does anyone know of a configuration workaround for this?  Currently we just bumped up the heap memory and I'm having to restart our agents every few days, which obviously isn't ideal.

Is anyone else seeing issues like this?  Or how do others use the HDFS sink to continuously write large amounts of logs from multiple source hosts?  I can get more in-depth about our setup/environment if necessary.

Here's a snippet of the one of  our 4 HDFS Sink configs:
agent.sinks.rest-xaction-hdfs-sink.type = hdfs
agent.sinks.rest-xaction-hdfs-sink.channel = rest-xaction-chan
agent.sinks.rest-xaction-hdfs-sink.hdfs.path = /user/svc-neb/rest_xaction_logs/date=%Y-%m-%d
agent.sinks.rest-xaction-hdfs-sink.hdfs.rollCount = 0
agent.sinks.rest-xaction-hdfs-sink.hdfs.rollSize = 0
agent.sinks.rest-xaction-hdfs-sink.hdfs.rollInterval = 3600
agent.sinks.rest-xaction-hdfs-sink.hdfs.idleTimeout = 300
agent.sinks.rest-xaction-hdfs-sink.hdfs.batchSize = 1000
agent.sinks.rest-xaction-hdfs-sink.hdfs.filePrefix = %{host}
agent.sinks.rest-xaction-hdfs-sink.hdfs.fileSuffix = .avro
agent.sinks.rest-xaction-hdfs-sink.hdfs.fileType = DataStream
agent.sinks.rest-xaction-hdfs-sink.serializer = avro_event

-Tim


Re: HDFSEventSink Memory Leak Workarounds

Posted by Connor Woodson <cw...@gmail.com>.
The other property you will want to look at is maxOpenFiles, which is the
number of file/paths held in memory at one time.

If you search for the email thread with subject "hdfs.idleTimeout ,what's
it used for ?" from back in January you will find a discussion along these
lines. As a quick summary, if rollInterval is not set to 0, you should
avoid using idleTimeout and should set maxOpenFiles to a reasonable number
(the default is 500 which is too large; I think that default is changed for
1.4).

- Connor


On Tue, May 21, 2013 at 9:59 AM, Tim Driscoll <ti...@gmail.com>wrote:

> Hello,
>
> We have a Flume Agent (version 1.3.1) set up using the HDFSEventSink.  We
> were noticing that we were running out of memory after a few days of
> running, and believe we had pinpointed it to an issue with using the
> hdfs.idleTimeout setting.  I believe this is fixed in 1.4 per FLUME-1864.
>
> Our planned workaround was to just remove the idleTimeout setting, which
> worked, but brought up another issue.  Since we are partitioning our data
> by timestamp, at midnight, we rolled over to a new bucket/partition, opened
> new bucket writers, and left the current bucket writers open.  Ideally the
> idleTimeout would clean this up.  So instead of a slow steady leak, we're
> encountering a 100MB leak every day.
>
> Short of upgrading Flume, does anyone know of a configuration workaround
> for this?  Currently we just bumped up the heap memory and I'm having to
> restart our agents every few days, which obviously isn't ideal.
>
> Is anyone else seeing issues like this?  Or how do others use the HDFS
> sink to continuously write large amounts of logs from multiple source
> hosts?  I can get more in-depth about our setup/environment if necessary.
>
> Here's a snippet of the one of  our 4 HDFS Sink configs:
> agent.sinks.rest-xaction-hdfs-sink.type = hdfs
> agent.sinks.rest-xaction-hdfs-sink.channel = rest-xaction-chan
> agent.sinks.rest-xaction-hdfs-sink.hdfs.path =
> /user/svc-neb/rest_xaction_logs/date=%Y-%m-%d
> agent.sinks.rest-xaction-hdfs-sink.hdfs.rollCount = 0
> agent.sinks.rest-xaction-hdfs-sink.hdfs.rollSize = 0
> agent.sinks.rest-xaction-hdfs-sink.hdfs.rollInterval = 3600
> agent.sinks.rest-xaction-hdfs-sink.hdfs.idleTimeout = 300
> agent.sinks.rest-xaction-hdfs-sink.hdfs.batchSize = 1000
> agent.sinks.rest-xaction-hdfs-sink.hdfs.filePrefix = %{host}
> agent.sinks.rest-xaction-hdfs-sink.hdfs.fileSuffix = .avro
> agent.sinks.rest-xaction-hdfs-sink.hdfs.fileType = DataStream
> agent.sinks.rest-xaction-hdfs-sink.serializer = avro_event
>
> -Tim
>