You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by "Mangtani, Kushal" <Ku...@viasat.com> on 2014/08/08 21:39:38 UTC

RE: File Channel Exception "Failed to obtain lock for writing to the log.Try increasing the log write timeout value"

Hello FlumeTeam,

I have recently seen a bug/weird behaviour in File Channel. I am using FileChannel in my prod env; so save me from hickups in my prod. Recently, I got my file Channel Full.
So; the only ways of fixing this was:

  1.  restart the flume process.
  2.  twaek the transactionCapacity of fileChannel.

i went with 1) .However, after doing so; my flume ps was stuck and the logs were:


08 Aug 2014 19:03:54,014 INFO  [lifecycleSupervisor-1-4] (org.apache.flume.channel.file.LogFile$SequentialReader.next:597)  - File position exceeds the threshold: 1623195647, position: 1623195649

08 Aug 2014 19:03:54,015 INFO  [lifecycleSupervisor-1-4] (org.apache.flume.channel.file.LogFile$SequentialReader.next:608)  - Encountered EOF at 1623195649 in /usr/lib/flume-ng/datastore/channel1/logs/log-5802


Looks like for some reason FilePointer was at a position > than the FileSize. Ultimately; I had to delete the logs,checkpoint,backup-checkpoint for my flume process to process events.

Sp; the whole purpose of FileChannel i.e better durability vs average performance was defeated here.


Questions:

  1.  Is there something I can have done to preserve this data Loss ?
  2.  Also; I believ Flume-ng is push -pull mechanism; where source pushes events to channels and sinks pulls events from channels which is contradictory to flume-og (push only mechanism). Correct me if im wrong? Was there a reason for this push-pull architecture in flume-land ?

Thanks,
Kushal Mangtani

________________________________
From: Hari Shreedharan [hshreedharan@cloudera.com]
Sent: Friday, February 28, 2014 11:38 AM
To: user@flume.apache.org
Subject: Re: File Channel Exception "Failed to obtain lock for writing to the log.Try increasing the log write timeout value"

It is currently in trunk, so it will be in flume 1.5


Thanks,
Hari


On Friday, February 28, 2014 at 11:30 AM, Mangtani, Kushal wrote:

Hari,



Thanks for the feedback.This was really helpful. I am going to use provisioned IO for a while to make sure the exception does not comes back.



Also, from the comments section of the Jira ticket given below, I noticed that you were able to identify the reason of the exception perhaps old logs are never deleted. Are you guys going to put a patch to in flume 1.5 so that this exception is resolved?



-Kushal mangtani



From: Hari Shreedharan [mailto:hshreedharan@cloudera.com]
Sent: Thursday, February 27, 2014 11:19 AM
To: user@flume.apache.org<ma...@flume.apache.org>
Subject: Re: File Channel Exception "Failed to obtain lock for writing to the log.Try increasing the log write timeout value"



See https://issues.apache.org/jira/browse/FLUME-2307<https://urldefense.proofpoint.com/v1/url?u=https://issues.apache.org/jira/browse/FLUME-2307&k=OWT%2FB14AE7ysJN06F7d2nQ%3D%3D%0A&r=Ige9%2FQENXuGqSGiXpuvHakVLuIySu7e10oNaj%2FGB%2B0I%3D%0A&m=PM9%2FMPLJ2TJ%2Fh%2BBMW%2BqpQ1UrxcZbZNPwx5%2FdhkJpEaw%3D%0A&s=91453e467ee8ed73fb29bace503614ae8091d624bdba0f77dedaf43b18e46c41>



This jira removed the write-timeout, but that only makes sure that there is no transaction in limbo. The real reason like I said is slow IO. Try using provisioned IO for better throughput.





Thanks,

Hari



On Thursday, February 27, 2014 at 10:48 AM, Mangtani, Kushal wrote:

Hari,



Thanks for the prompt reply. The current file channel’s  write-timeout = 30 sec .EBS drive current  capacity = 200 GB . The rate of writes is 60 events/min; where each event is approx. 40 KB.



I am thinking of increase file channel write-timeout to 60 sec. What do you suggest?

Also,one strange thing I noticed all the flume-collectors  also get the same exception.However, all have a separate ebs drive. Any inputs?



Thanks,

Kushal Mangtani



From: Hari Shreedharan [mailto:hshreedharan@cloudera.com]
Sent: Thursday, February 27, 2014 10:35 AM
To: user@flume.apache.org<ma...@flume.apache.org>
Subject: Re: File Channel Exception "Failed to obtain lock for writing to the log.Try increasing the log write timeout value"



For now, increase the file channel’s write-timeout parameter to around 30 or so (basically file channel is timing out while writing to disk). But the basic problem you are seeing is that your EBS instance is very slow and IO is taking too long. You either need to increase your EBS IO capacity, or reduce the rate or writes.





Thanks,

Hari



On Thursday, February 27, 2014 at 10:28 AM, Mangtani, Kushal wrote:





From: Mangtani, Kushal
Sent: Wednesday, February 26, 2014 4:51 PM
To: 'user@flume.apache.org<ma...@flume.apache.org>'; 'user-subscribe@flume.apache.org<ma...@flume.apache.org>'
Cc: Rangnekar, Rohit; 'dev@flume.apache.org<ma...@flume.apache.org>'
Subject: File Channel Exception "Failed to obtain lock for writing to the log.Try increasing the log write timeout value"



Hi,



I'm using Flume-Ng 1.4 cdh4.4 Tarball for collecting aggregated logs.

I am running a 2 tier(agent,collector) Flume Configuration with custom plugins. There are approximately 20 agents (receiving data) and 6 collector flume (writing to HDFS) machines all running independenly. However, I have been facing some File Channel Exceptions on the collector side. The agent appears to be working fine.



 Error  stacktrace:

                             org.apache.flume.ChannelException: Failed to obtain lock for writing to the log. Try increasing the log write timeout value. [channel=c2]

                             at org.apache.flume.channel.file.FileChannel$FileBackedTransaction.doRollback(FileChannel.java:621)

                             at org.apache.flume.channel.BasicTransactionSemantics.rollback(BasicTransactionSemantics.java:168)

                             at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:421)

                             at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)

                             at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)

                             …..

                             And I keep on getting the same error



                             P.S :This same exception is repated in most of the flume collector machines.But, not at the same duration. There is usually a difference of a couple of hours or more.



1.  HDFS sinks are written in  the Amazon EC2 cloud instance.

2. datadir and checkpoint dir of file channel in all flume collector instances are mounted to a separate hadoop ebs drive .This makes sure that two separate collectors do not overlap their log and checkpoint dir. There is a symbolic link i.e /usr/lib/flume-ng/datasource --> /hadoop/ebs/mnt-1

3. The Flume works fine for a couple of days and all the agent,collector are initialized properly without exceptions.



Questions:

Exception “Failed to obtain lock for writing to the log. Try increasing the log write timeout value . [channel=c2]” . According to the documentation, such an exception occurs only if two processes are acceesing the same file/directory. However, each channel is configured separately so No two channels should access the same dir. Hence, this exception does not indicates anything. Please correct me, if im wrong.

Also, HDFS.CallTimeout – indicates calling HDFS for open,write operations. If no response within a duration, it timeouts. And , if its timeouts; it closes the File. Please correct me, if im wrong.  Also, if there is a way to specify the number of retries before it closes the file?



Your inputs/suggestions will be thoroughly appreciated.





Regards

Kushal Mangtani

Software Engineer








Re: File Channel Exception "Failed to obtain lock for writing to the log.Try increasing the log write timeout value"

Posted by Hari Shreedharan <hs...@cloudera.com>.
Can you try the 1.5 release? There were a few fixes that went in.

Mangtani, Kushal wrote:
>
> Apache flume 1.4 Tarball
> ------------------------------------------------------------------------
> *From:* Hari Shreedharan [hshreedharan@apache.org]
> *Sent:* Friday, August 15, 2014 9:27 AM
> *To:* user@flume.apache.org
> *Subject:* Re: File Channel Exception "Failed to obtain lock for
> writing to the log.Try increasing the log write timeout value"
>
> What version of Flume are you using?
>
>
> On Tue, Aug 12, 2014 at 1:51 PM, Mangtani, Kushal
> <Kushal.Mangtani@viasat.com <ma...@viasat.com>> wrote:
>
> Bumping this up; to make sure someone answers this.
>
> P.S: let me know if i need to post these questions on a seperate
> thread.
>
> Thanks,
> Kushal Mangtani
>
> ------------------------------------------------------------------------
> *From:* Mangtani, Kushal
> *Sent:* Friday, August 08, 2014 12:39 PM
> *To:* user@flume.apache.org <ma...@flume.apache.org>
> *Subject:* RE: File Channel Exception "Failed to obtain lock for
> writing to the log.Try increasing the log write timeout value"
>
> Hello FlumeTeam,
>
> I have recently seen a bug/weird behaviour in File Channel. I am
> using FileChannel in my prod env; so save me from hickups in my
> prod. Recently, I got my file Channel Full.
> So; the only ways of fixing this was:
>
> 1. restart the flume process.
> 2. twaek the transactionCapacity of fileChannel.
>
> i went with 1) .However, after doing so; my flume ps was stuck and
> the logs were:
>
> 08 Aug 2014 19:03:54,014 INFO [lifecycleSupervisor-1-4]
> (org.apache.flume.channel.file.LogFile$SequentialReader.next:597)
> - File position exceeds the threshold: 1623195647, position:
> 1623195649
>
> 08 Aug 2014 19:03:54,015 INFO [lifecycleSupervisor-1-4]
> (org.apache.flume.channel.file.LogFile$SequentialReader.next:608)
> - Encountered EOF at 1623195649 in
> /usr/lib/flume-ng/datastore/channel1/logs/log-5802
>
>
> Looks like for some reason FilePointer was at a position > than
> the FileSize. Ultimately; I had to delete the
> logs,checkpoint,backup-checkpoint for my flume process to process
> events.
>
> Sp; the whole purpose of FileChannel i.e better durability vs
> average performance was defeated here.
>
>
> Questions:
>
> 1. Is there something I can have done to preserve this data Loss ?
> 2. Also; I believ Flume-ng is push -pull mechanism; where source
> pushes events to channels and sinks pulls events from channels
> which is contradictory to flume-og (push only mechanism).
> Correct me if im wrong? Was there a reason for this push-pull
> architecture in flume-land ?
>
> Thanks,
> Kushal Mangtani
>
> ------------------------------------------------------------------------
> *From:* Hari Shreedharan [hshreedharan@cloudera.com
> <ma...@cloudera.com>]
> *Sent:* Friday, February 28, 2014 11:38 AM
> *To:* user@flume.apache.org <ma...@flume.apache.org>
> *Subject:* Re: File Channel Exception "Failed to obtain lock for
> writing to the log.Try increasing the log write timeout value"
>
> It is currently in trunk, so it will be in flume 1.5
>
>
> Thanks,
> Hari
>
> On Friday, February 28, 2014 at 11:30 AM, Mangtani, Kushal wrote:
>
>>
>> Hari,
>>
>> Thanks for the feedback.This was really helpful. I am going to
>> use provisioned IO for a while to make sure the exception does
>> not comes back.
>>
>> Also, from the comments section of the Jira ticket given below, I
>> noticed that you were able to identify the reason of the
>> exception perhaps old logs are never deleted. Are you guys going
>> to put a patch to in flume 1.5 so that this exception is resolved?
>>
>> -Kushal mangtani
>>
>> *From:*Hari Shreedharan [mailto:hshreedharan@cloudera.com]
>> *Sent:* Thursday, February 27, 2014 11:19 AM
>> *To:* user@flume.apache.org <ma...@flume.apache.org>
>> *Subject:* Re: File Channel Exception "Failed to obtain lock for
>> writing to the log.Try increasing the log write timeout value"
>>
>> See https://issues.apache.org/jira/browse/FLUME-2307
>> <https://urldefense.proofpoint.com/v1/url?u=https://issues.apache.org/jira/browse/FLUME-2307&k=OWT%2FB14AE7ysJN06F7d2nQ%3D%3D%0A&r=Ige9%2FQENXuGqSGiXpuvHakVLuIySu7e10oNaj%2FGB%2B0I%3D%0A&m=PM9%2FMPLJ2TJ%2Fh%2BBMW%2BqpQ1UrxcZbZNPwx5%2FdhkJpEaw%3D%0A&s=91453e467ee8ed73fb29bace503614ae8091d624bdba0f77dedaf43b18e46c41>
>>
>>
>> This jira removed the write-timeout, but that only makes sure
>> that there is no transaction in limbo. The real reason like I
>> said is slow IO. Try using provisioned IO for better throughput.
>>
>> Thanks,
>>
>> Hari
>>
>> On Thursday, February 27, 2014 at 10:48 AM, Mangtani, Kushal wrote:
>>
>> Hari,
>>
>> Thanks for the prompt reply. The current file channel’s
>> write-timeout = 30 sec .EBS drive current capacity = 200 GB .
>> The rate of writes is 60 events/min; where each event is
>> approx. 40 KB.
>>
>> I am thinking of increase file channel write-timeout to 60
>> sec. What do you suggest?
>>
>> Also,one strange thing I noticed all the flume-collectors
>> also get the same exception.However, all have a separate ebs
>> drive. Any inputs?
>>
>> Thanks,
>>
>> Kushal Mangtani
>>
>> *From:*Hari Shreedharan [mailto:hshreedharan@cloudera.com]
>> *Sent:* Thursday, February 27, 2014 10:35 AM
>> *To:* user@flume.apache.org <ma...@flume.apache.org>
>> *Subject:* Re: File Channel Exception "Failed to obtain lock
>> for writing to the log.Try increasing the log write timeout
>> value"
>>
>> For now, increase the file channel’s write-timeout parameter
>> to around 30 or so (basically file channel is timing out
>> while writing to disk). But the basic problem you are seeing
>> is that your EBS instance is very slow and IO is taking too
>> long. You either need to increase your EBS IO capacity, or
>> reduce the rate or writes.
>>
>> Thanks,
>>
>> Hari
>>
>> On Thursday, February 27, 2014 at 10:28 AM, Mangtani, Kushal
>> wrote:
>>
>> *From:*Mangtani, Kushal
>> *Sent:* Wednesday, February 26, 2014 4:51 PM
>> *To:* 'user@flume.apache.org
>> <ma...@flume.apache.org>';
>> 'user-subscribe@flume.apache.org
>> <ma...@flume.apache.org>'
>> *Cc:* Rangnekar, Rohit; 'dev@flume.apache.org
>> <ma...@flume.apache.org>'
>> *Subject:* File Channel Exception "Failed to obtain lock
>> for writing to the log.Try increasing the log write
>> timeout value"
>>
>> Hi,
>>
>> I'm using Flume-Ng 1.4 cdh4.4 Tarball for collecting
>> aggregated logs.
>>
>> I am running a 2 tier(agent,collector) Flume
>> Configuration with custom plugins. There are
>> approximately 20 agents (receiving data) and 6 collector
>> flume (writing to HDFS) machines all running
>> independenly. However, I have been facing some File
>> Channel Exceptions on the collector side. The agent
>> appears to be working fine.
>>
>> Error stacktrace:
>>
>> org.apache.flume.ChannelException: Failed to obtain lock
>> for writing to the log. Try increasing the log write
>> timeout value. [channel=c2]
>>
>> at
>> org.apache.flume.channel.file.FileChannel$FileBackedTransaction.doRollback(FileChannel.java:621)
>>
>> at
>> org.apache.flume.channel.BasicTransactionSemantics.rollback(BasicTransactionSemantics.java:168)
>>
>> at
>> org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:421)
>>
>> at
>> org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
>>
>> at
>> org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
>>
>> …..
>>
>> And I keep on getting the same error
>>
>> P.S :This same exception is repated in most of the flume
>> collector machines.But, not at the same duration. There
>> is usually a difference of a couple of hours or more.
>>
>> 1. HDFS sinks are written in the Amazon EC2 cloud instance.
>>
>> 2. datadir and checkpoint dir of file channel in all
>> flume collector instances are mounted to a separate
>> hadoop ebs drive .This makes sure that two separate
>> collectors do not overlap their log and checkpoint dir.
>> There is a symbolic link i.e /usr/lib/flume-ng/datasource
>> à/hadoop/ebs/mnt-1
>>
>> 3.The Flume works fine for a couple of days and all the
>> agent,collector are initialized properly without exceptions.
>>
>> Questions:
>>
>> Exception “Failed to obtain lock for writing to the log.
>> Try increasing the log write timeout value .
>> [channel=c2]” . According to the documentation, such an
>> exception occurs only if two processes are acceesing the
>> same file/directory. However, each channel is configured
>> separately so No two channels should access the same dir.
>> Hence, this exception does not indicates anything. Please
>> correct me, if im wrong.
>>
>> Also, HDFS.CallTimeout – indicates calling HDFS for
>> open,write operations. If no response within a duration,
>> it timeouts. And , if its timeouts; it closes the File.
>> Please correct me, if im wrong. Also, if there is a way
>> to specify the number of retries before it closes the file?
>>
>> Your inputs/suggestions will be thoroughly appreciated.
>>
>> Regards
>>
>> Kushal Mangtani
>>
>> Software Engineer
>>
>
>

RE: File Channel Exception "Failed to obtain lock for writing to the log.Try increasing the log write timeout value"

Posted by "Mangtani, Kushal" <Ku...@viasat.com>.
Apache flume 1.4 Tarball
________________________________
From: Hari Shreedharan [hshreedharan@apache.org]
Sent: Friday, August 15, 2014 9:27 AM
To: user@flume.apache.org
Subject: Re: File Channel Exception "Failed to obtain lock for writing to the log.Try increasing the log write timeout value"

What version of Flume are you using?


On Tue, Aug 12, 2014 at 1:51 PM, Mangtani, Kushal <Ku...@viasat.com>> wrote:
Bumping this up; to make sure someone answers this.

P.S: let me know if i need to post these questions on a seperate thread.

Thanks,
Kushal Mangtani

________________________________
From: Mangtani, Kushal
Sent: Friday, August 08, 2014 12:39 PM
To: user@flume.apache.org<ma...@flume.apache.org>
Subject: RE: File Channel Exception "Failed to obtain lock for writing to the log.Try increasing the log write timeout value"

Hello FlumeTeam,

I have recently seen a bug/weird behaviour in File Channel. I am using FileChannel in my prod env; so save me from hickups in my prod. Recently, I got my file Channel Full.
So; the only ways of fixing this was:

  1.  restart the flume process.
  2.  twaek the transactionCapacity of fileChannel.

i went with 1) .However, after doing so; my flume ps was stuck and the logs were:


08 Aug 2014 19:03:54,014 INFO  [lifecycleSupervisor-1-4] (org.apache.flume.channel.file.LogFile$SequentialReader.next:597)  - File position exceeds the threshold: 1623195647, position: 1623195649

08 Aug 2014 19:03:54,015 INFO  [lifecycleSupervisor-1-4] (org.apache.flume.channel.file.LogFile$SequentialReader.next:608)  - Encountered EOF at 1623195649 in /usr/lib/flume-ng/datastore/channel1/logs/log-5802


Looks like for some reason FilePointer was at a position > than the FileSize. Ultimately; I had to delete the logs,checkpoint,backup-checkpoint for my flume process to process events.

Sp; the whole purpose of FileChannel i.e better durability vs average performance was defeated here.


Questions:

  1.  Is there something I can have done to preserve this data Loss ?
  2.  Also; I believ Flume-ng is push -pull mechanism; where source pushes events to channels and sinks pulls events from channels which is contradictory to flume-og (push only mechanism). Correct me if im wrong? Was there a reason for this push-pull architecture in flume-land ?

Thanks,
Kushal Mangtani

________________________________
From: Hari Shreedharan [hshreedharan@cloudera.com<ma...@cloudera.com>]
Sent: Friday, February 28, 2014 11:38 AM
To: user@flume.apache.org<ma...@flume.apache.org>
Subject: Re: File Channel Exception "Failed to obtain lock for writing to the log.Try increasing the log write timeout value"

It is currently in trunk, so it will be in flume 1.5


Thanks,
Hari


On Friday, February 28, 2014 at 11:30 AM, Mangtani, Kushal wrote:

Hari,



Thanks for the feedback.This was really helpful. I am going to use provisioned IO for a while to make sure the exception does not comes back.



Also, from the comments section of the Jira ticket given below, I noticed that you were able to identify the reason of the exception perhaps old logs are never deleted. Are you guys going to put a patch to in flume 1.5 so that this exception is resolved?



-Kushal mangtani



From: Hari Shreedharan [mailto:hshreedharan@cloudera.com]
Sent: Thursday, February 27, 2014 11:19 AM
To: user@flume.apache.org<ma...@flume.apache.org>
Subject: Re: File Channel Exception "Failed to obtain lock for writing to the log.Try increasing the log write timeout value"



See https://issues.apache.org/jira/browse/FLUME-2307<https://urldefense.proofpoint.com/v1/url?u=https://issues.apache.org/jira/browse/FLUME-2307&k=OWT%2FB14AE7ysJN06F7d2nQ%3D%3D%0A&r=Ige9%2FQENXuGqSGiXpuvHakVLuIySu7e10oNaj%2FGB%2B0I%3D%0A&m=PM9%2FMPLJ2TJ%2Fh%2BBMW%2BqpQ1UrxcZbZNPwx5%2FdhkJpEaw%3D%0A&s=91453e467ee8ed73fb29bace503614ae8091d624bdba0f77dedaf43b18e46c41>



This jira removed the write-timeout, but that only makes sure that there is no transaction in limbo. The real reason like I said is slow IO. Try using provisioned IO for better throughput.





Thanks,

Hari



On Thursday, February 27, 2014 at 10:48 AM, Mangtani, Kushal wrote:

Hari,



Thanks for the prompt reply. The current file channel’s  write-timeout = 30 sec .EBS drive current  capacity = 200 GB . The rate of writes is 60 events/min; where each event is approx. 40 KB.



I am thinking of increase file channel write-timeout to 60 sec. What do you suggest?

Also,one strange thing I noticed all the flume-collectors  also get the same exception.However, all have a separate ebs drive. Any inputs?



Thanks,

Kushal Mangtani



From: Hari Shreedharan [mailto:hshreedharan@cloudera.com]
Sent: Thursday, February 27, 2014 10:35 AM
To: user@flume.apache.org<ma...@flume.apache.org>
Subject: Re: File Channel Exception "Failed to obtain lock for writing to the log.Try increasing the log write timeout value"



For now, increase the file channel’s write-timeout parameter to around 30 or so (basically file channel is timing out while writing to disk). But the basic problem you are seeing is that your EBS instance is very slow and IO is taking too long. You either need to increase your EBS IO capacity, or reduce the rate or writes.





Thanks,

Hari



On Thursday, February 27, 2014 at 10:28 AM, Mangtani, Kushal wrote:





From: Mangtani, Kushal
Sent: Wednesday, February 26, 2014 4:51 PM
To: 'user@flume.apache.org<ma...@flume.apache.org>'; 'user-subscribe@flume.apache.org<ma...@flume.apache.org>'
Cc: Rangnekar, Rohit; 'dev@flume.apache.org<ma...@flume.apache.org>'
Subject: File Channel Exception "Failed to obtain lock for writing to the log.Try increasing the log write timeout value"



Hi,



I'm using Flume-Ng 1.4 cdh4.4 Tarball for collecting aggregated logs.

I am running a 2 tier(agent,collector) Flume Configuration with custom plugins. There are approximately 20 agents (receiving data) and 6 collector flume (writing to HDFS) machines all running independenly. However, I have been facing some File Channel Exceptions on the collector side. The agent appears to be working fine.



 Error  stacktrace:

                             org.apache.flume.ChannelException: Failed to obtain lock for writing to the log. Try increasing the log write timeout value. [channel=c2]

                             at org.apache.flume.channel.file.FileChannel$FileBackedTransaction.doRollback(FileChannel.java:621)

                             at org.apache.flume.channel.BasicTransactionSemantics.rollback(BasicTransactionSemantics.java:168)

                             at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:421)

                             at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)

                             at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)

                             …..

                             And I keep on getting the same error



                             P.S :This same exception is repated in most of the flume collector machines.But, not at the same duration. There is usually a difference of a couple of hours or more.



1.  HDFS sinks are written in  the Amazon EC2 cloud instance.

2. datadir and checkpoint dir of file channel in all flume collector instances are mounted to a separate hadoop ebs drive .This makes sure that two separate collectors do not overlap their log and checkpoint dir. There is a symbolic link i.e /usr/lib/flume-ng/datasource --> /hadoop/ebs/mnt-1

3. The Flume works fine for a couple of days and all the agent,collector are initialized properly without exceptions.



Questions:

Exception “Failed to obtain lock for writing to the log. Try increasing the log write timeout value . [channel=c2]” . According to the documentation, such an exception occurs only if two processes are acceesing the same file/directory. However, each channel is configured separately so No two channels should access the same dir. Hence, this exception does not indicates anything. Please correct me, if im wrong.

Also, HDFS.CallTimeout – indicates calling HDFS for open,write operations. If no response within a duration, it timeouts. And , if its timeouts; it closes the File. Please correct me, if im wrong.  Also, if there is a way to specify the number of retries before it closes the file?



Your inputs/suggestions will be thoroughly appreciated.





Regards

Kushal Mangtani

Software Engineer









Re: File Channel Exception "Failed to obtain lock for writing to the log.Try increasing the log write timeout value"

Posted by Hari Shreedharan <hs...@apache.org>.
What version of Flume are you using?


On Tue, Aug 12, 2014 at 1:51 PM, Mangtani, Kushal <
Kushal.Mangtani@viasat.com> wrote:

>  Bumping this up; to make sure someone answers this.
>
> P.S: let me know if i need to post these questions on a seperate thread.
>
>  Thanks,
> Kushal Mangtani
>
>  ------------------------------
> *From:* Mangtani, Kushal
> *Sent:* Friday, August 08, 2014 12:39 PM
> *To:* user@flume.apache.org
> *Subject:* RE: File Channel Exception "Failed to obtain lock for writing
> to the log.Try increasing the log write timeout value"
>
>   Hello FlumeTeam,
>
>  I have recently seen a bug/weird behaviour in File Channel. I am using
> FileChannel in my prod env; so save me from hickups in my prod. Recently, I
> got my file Channel Full.
> So; the only ways of fixing this was:
>
>    1. restart the flume process.
>    2. twaek the transactionCapacity of fileChannel.
>
> i went with 1) .However, after doing so; my flume ps was stuck and the
> logs were:
>
>  08 Aug 2014 19:03:54,014 INFO  [lifecycleSupervisor-1-4]
> (org.apache.flume.channel.file.LogFile$SequentialReader.next:597)  - File
> position exceeds the threshold: 1623195647, position: 1623195649
>
> 08 Aug 2014 19:03:54,015 INFO  [lifecycleSupervisor-1-4]
> (org.apache.flume.channel.file.LogFile$SequentialReader.next:608)  -
> Encountered EOF at 1623195649 in
> /usr/lib/flume-ng/datastore/channel1/logs/log-5802
>
>
>  Looks like for some reason FilePointer was at a position > than the
> FileSize. Ultimately; I had to delete the logs,checkpoint,backup-checkpoint
> for my flume process to process events.
>
> Sp; the whole purpose of FileChannel i.e better durability vs average
> performance was defeated here.
>
>
>  Questions:
>
>
>    1. Is there something I can have done to preserve this data Loss ?
>    2. Also; I believ Flume-ng is push -pull mechanism; where source
>    pushes events to channels and sinks pulls events from channels which is
>    contradictory to flume-og (push only mechanism). Correct me if im wrong?
>    Was there a reason for this push-pull architecture in flume-land ?
>
> Thanks,
> Kushal Mangtani
>
>  ------------------------------
> *From:* Hari Shreedharan [hshreedharan@cloudera.com]
> *Sent:* Friday, February 28, 2014 11:38 AM
> *To:* user@flume.apache.org
> *Subject:* Re: File Channel Exception "Failed to obtain lock for writing
> to the log.Try increasing the log write timeout value"
>
>   It is currently in trunk, so it will be in flume 1.5
>
>
> Thanks,
> Hari
>
>  On Friday, February 28, 2014 at 11:30 AM, Mangtani, Kushal wrote:
>
>   Hari,
>
>
>
> Thanks for the feedback.This was really helpful. I am going to use
> provisioned IO for a while to make sure the exception does not comes back.
>
>
>
> Also, from the comments section of the Jira ticket given below, I noticed
> that you were able to identify the reason of the exception perhaps old logs
> are never deleted. Are you guys going to put a patch to in flume 1.5 so
> that this exception is resolved?
>
>
>
> -Kushal mangtani
>
>
>
> *From:* Hari Shreedharan [mailto:hshreedharan@cloudera.com
> <hs...@cloudera.com>]
> *Sent:* Thursday, February 27, 2014 11:19 AM
> *To:* user@flume.apache.org
> *Subject:* Re: File Channel Exception "Failed to obtain lock for writing
> to the log.Try increasing the log write timeout value"
>
>
>
> See https://issues.apache.org/jira/browse/FLUME-2307
> <https://urldefense.proofpoint.com/v1/url?u=https://issues.apache.org/jira/browse/FLUME-2307&k=OWT%2FB14AE7ysJN06F7d2nQ%3D%3D%0A&r=Ige9%2FQENXuGqSGiXpuvHakVLuIySu7e10oNaj%2FGB%2B0I%3D%0A&m=PM9%2FMPLJ2TJ%2Fh%2BBMW%2BqpQ1UrxcZbZNPwx5%2FdhkJpEaw%3D%0A&s=91453e467ee8ed73fb29bace503614ae8091d624bdba0f77dedaf43b18e46c41>
>
>
>
> This jira removed the write-timeout, but that only makes sure that there
> is no transaction in limbo. The real reason like I said is slow IO. Try
> using provisioned IO for better throughput.
>
>
>
>
>
> Thanks,
>
> Hari
>
>
>
> On Thursday, February 27, 2014 at 10:48 AM, Mangtani, Kushal wrote:
>
>   Hari,
>
>
>
> Thanks for the prompt reply. The current file channel’s  write-timeout =
> 30 sec .EBS drive current  capacity = 200 GB . The rate of writes is 60
> events/min; where each event is approx. 40 KB.
>
>
>
> I am thinking of increase file channel write-timeout to 60 sec. What do
> you suggest?
>
> Also,one strange thing I noticed all the flume-collectors  also get the
> same exception.However, all have a separate ebs drive. Any inputs?
>
>
>
> Thanks,
>
> Kushal Mangtani
>
>
>
> *From:* Hari Shreedharan [mailto:hshreedharan@cloudera.com
> <hs...@cloudera.com>]
> *Sent:* Thursday, February 27, 2014 10:35 AM
> *To:* user@flume.apache.org
> *Subject:* Re: File Channel Exception "Failed to obtain lock for writing
> to the log.Try increasing the log write timeout value"
>
>
>
> For now, increase the file channel’s write-timeout parameter to around 30
> or so (basically file channel is timing out while writing to disk). But the
> basic problem you are seeing is that your EBS instance is very slow and IO
> is taking too long. You either need to increase your EBS IO capacity, or
> reduce the rate or writes.
>
>
>
>
>
> Thanks,
>
> Hari
>
>
>
> On Thursday, February 27, 2014 at 10:28 AM, Mangtani, Kushal wrote:
>
>
>
>
>
> *From:* Mangtani, Kushal
> *Sent:* Wednesday, February 26, 2014 4:51 PM
> *To:* 'user@flume.apache.org'; 'user-subscribe@flume.apache.org'
> *Cc:* Rangnekar, Rohit; 'dev@flume.apache.org'
> *Subject:* File Channel Exception "Failed to obtain lock for writing to
> the log.Try increasing the log write timeout value"
>
>
>
> Hi,
>
>
>
> I'm using Flume-Ng 1.4 cdh4.4 Tarball for collecting aggregated logs.
>
> I am running a 2 tier(agent,collector) Flume Configuration with custom
> plugins. There are approximately 20 agents (receiving data) and 6 collector
> flume (writing to HDFS) machines all running independenly. However, I have
> been facing some File Channel Exceptions on the collector side. The agent
> appears to be working fine.
>
>
>
>  Error  stacktrace:
>
>                              org.apache.flume.ChannelException: Failed to
> obtain lock for writing to the log. Try increasing the log write timeout
> value. [channel=c2]
>
>                              at
> org.apache.flume.channel.file.FileChannel$FileBackedTransaction.doRollback(FileChannel.java:621)
>
>                              at
> org.apache.flume.channel.BasicTransactionSemantics.rollback(BasicTransactionSemantics.java:168)
>
>                              at
> org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:421)
>
>                              at
> org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
>
>                              at
> org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
>
>                              …..
>
>                              And I keep on getting the same error
>
>
>
>                              P.S :This same exception is repated in most
> of the flume collector machines.But, not at the same duration. There is
> usually a difference of a couple of hours or more.
>
>
>
> 1.  HDFS sinks are written in  the Amazon EC2 cloud instance.
>
> 2. datadir and checkpoint dir of file channel in all flume collector
> instances are mounted to a separate hadoop ebs drive .This makes sure that
> two separate collectors do not overlap their log and checkpoint dir. There
> is a symbolic link i.e /usr/lib/flume-ng/datasource à /hadoop/ebs/mnt-1
>
> 3. The Flume works fine for a couple of days and all the agent,collector
> are initialized properly without exceptions.
>
>
>
> Questions:
>
> Exception “Failed to obtain lock for writing to the log. Try increasing
> the log write timeout value . [channel=c2]” . According to the
> documentation, such an exception occurs only if two processes are acceesing
> the same file/directory. However, each channel is configured separately so
> No two channels should access the same dir. Hence, this exception does not
> indicates anything. Please correct me, if im wrong.
>
> Also, HDFS.CallTimeout – indicates calling HDFS for open,write operations.
> If no response within a duration, it timeouts. And , if its timeouts; it
> closes the File. Please correct me, if im wrong.  Also, if there is a way
> to specify the number of retries before it closes the file?
>
>
>
> Your inputs/suggestions will be thoroughly appreciated.
>
>
>
>
>
> Regards
>
> Kushal Mangtani
>
> Software Engineer
>
>
>
>
>
>
>
>
>

RE: File Channel Exception "Failed to obtain lock for writing to the log.Try increasing the log write timeout value"

Posted by "Mangtani, Kushal" <Ku...@viasat.com>.
Bumping this up; to make sure someone answers this.

P.S: let me know if i need to post these questions on a seperate thread.

Thanks,
Kushal Mangtani

________________________________
From: Mangtani, Kushal
Sent: Friday, August 08, 2014 12:39 PM
To: user@flume.apache.org
Subject: RE: File Channel Exception "Failed to obtain lock for writing to the log.Try increasing the log write timeout value"

Hello FlumeTeam,

I have recently seen a bug/weird behaviour in File Channel. I am using FileChannel in my prod env; so save me from hickups in my prod. Recently, I got my file Channel Full.
So; the only ways of fixing this was:

  1.  restart the flume process.
  2.  twaek the transactionCapacity of fileChannel.

i went with 1) .However, after doing so; my flume ps was stuck and the logs were:


08 Aug 2014 19:03:54,014 INFO  [lifecycleSupervisor-1-4] (org.apache.flume.channel.file.LogFile$SequentialReader.next:597)  - File position exceeds the threshold: 1623195647, position: 1623195649

08 Aug 2014 19:03:54,015 INFO  [lifecycleSupervisor-1-4] (org.apache.flume.channel.file.LogFile$SequentialReader.next:608)  - Encountered EOF at 1623195649 in /usr/lib/flume-ng/datastore/channel1/logs/log-5802


Looks like for some reason FilePointer was at a position > than the FileSize. Ultimately; I had to delete the logs,checkpoint,backup-checkpoint for my flume process to process events.

Sp; the whole purpose of FileChannel i.e better durability vs average performance was defeated here.


Questions:

  1.  Is there something I can have done to preserve this data Loss ?
  2.  Also; I believ Flume-ng is push -pull mechanism; where source pushes events to channels and sinks pulls events from channels which is contradictory to flume-og (push only mechanism). Correct me if im wrong? Was there a reason for this push-pull architecture in flume-land ?

Thanks,
Kushal Mangtani

________________________________
From: Hari Shreedharan [hshreedharan@cloudera.com]
Sent: Friday, February 28, 2014 11:38 AM
To: user@flume.apache.org
Subject: Re: File Channel Exception "Failed to obtain lock for writing to the log.Try increasing the log write timeout value"

It is currently in trunk, so it will be in flume 1.5


Thanks,
Hari


On Friday, February 28, 2014 at 11:30 AM, Mangtani, Kushal wrote:

Hari,



Thanks for the feedback.This was really helpful. I am going to use provisioned IO for a while to make sure the exception does not comes back.



Also, from the comments section of the Jira ticket given below, I noticed that you were able to identify the reason of the exception perhaps old logs are never deleted. Are you guys going to put a patch to in flume 1.5 so that this exception is resolved?



-Kushal mangtani



From: Hari Shreedharan [mailto:hshreedharan@cloudera.com]
Sent: Thursday, February 27, 2014 11:19 AM
To: user@flume.apache.org<ma...@flume.apache.org>
Subject: Re: File Channel Exception "Failed to obtain lock for writing to the log.Try increasing the log write timeout value"



See https://issues.apache.org/jira/browse/FLUME-2307<https://urldefense.proofpoint.com/v1/url?u=https://issues.apache.org/jira/browse/FLUME-2307&k=OWT%2FB14AE7ysJN06F7d2nQ%3D%3D%0A&r=Ige9%2FQENXuGqSGiXpuvHakVLuIySu7e10oNaj%2FGB%2B0I%3D%0A&m=PM9%2FMPLJ2TJ%2Fh%2BBMW%2BqpQ1UrxcZbZNPwx5%2FdhkJpEaw%3D%0A&s=91453e467ee8ed73fb29bace503614ae8091d624bdba0f77dedaf43b18e46c41>



This jira removed the write-timeout, but that only makes sure that there is no transaction in limbo. The real reason like I said is slow IO. Try using provisioned IO for better throughput.





Thanks,

Hari



On Thursday, February 27, 2014 at 10:48 AM, Mangtani, Kushal wrote:

Hari,



Thanks for the prompt reply. The current file channel’s  write-timeout = 30 sec .EBS drive current  capacity = 200 GB . The rate of writes is 60 events/min; where each event is approx. 40 KB.



I am thinking of increase file channel write-timeout to 60 sec. What do you suggest?

Also,one strange thing I noticed all the flume-collectors  also get the same exception.However, all have a separate ebs drive. Any inputs?



Thanks,

Kushal Mangtani



From: Hari Shreedharan [mailto:hshreedharan@cloudera.com]
Sent: Thursday, February 27, 2014 10:35 AM
To: user@flume.apache.org<ma...@flume.apache.org>
Subject: Re: File Channel Exception "Failed to obtain lock for writing to the log.Try increasing the log write timeout value"



For now, increase the file channel’s write-timeout parameter to around 30 or so (basically file channel is timing out while writing to disk). But the basic problem you are seeing is that your EBS instance is very slow and IO is taking too long. You either need to increase your EBS IO capacity, or reduce the rate or writes.





Thanks,

Hari



On Thursday, February 27, 2014 at 10:28 AM, Mangtani, Kushal wrote:





From: Mangtani, Kushal
Sent: Wednesday, February 26, 2014 4:51 PM
To: 'user@flume.apache.org<ma...@flume.apache.org>'; 'user-subscribe@flume.apache.org<ma...@flume.apache.org>'
Cc: Rangnekar, Rohit; 'dev@flume.apache.org<ma...@flume.apache.org>'
Subject: File Channel Exception "Failed to obtain lock for writing to the log.Try increasing the log write timeout value"



Hi,



I'm using Flume-Ng 1.4 cdh4.4 Tarball for collecting aggregated logs.

I am running a 2 tier(agent,collector) Flume Configuration with custom plugins. There are approximately 20 agents (receiving data) and 6 collector flume (writing to HDFS) machines all running independenly. However, I have been facing some File Channel Exceptions on the collector side. The agent appears to be working fine.



 Error  stacktrace:

                             org.apache.flume.ChannelException: Failed to obtain lock for writing to the log. Try increasing the log write timeout value. [channel=c2]

                             at org.apache.flume.channel.file.FileChannel$FileBackedTransaction.doRollback(FileChannel.java:621)

                             at org.apache.flume.channel.BasicTransactionSemantics.rollback(BasicTransactionSemantics.java:168)

                             at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:421)

                             at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)

                             at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)

                             …..

                             And I keep on getting the same error



                             P.S :This same exception is repated in most of the flume collector machines.But, not at the same duration. There is usually a difference of a couple of hours or more.



1.  HDFS sinks are written in  the Amazon EC2 cloud instance.

2. datadir and checkpoint dir of file channel in all flume collector instances are mounted to a separate hadoop ebs drive .This makes sure that two separate collectors do not overlap their log and checkpoint dir. There is a symbolic link i.e /usr/lib/flume-ng/datasource --> /hadoop/ebs/mnt-1

3. The Flume works fine for a couple of days and all the agent,collector are initialized properly without exceptions.



Questions:

Exception “Failed to obtain lock for writing to the log. Try increasing the log write timeout value . [channel=c2]” . According to the documentation, such an exception occurs only if two processes are acceesing the same file/directory. However, each channel is configured separately so No two channels should access the same dir. Hence, this exception does not indicates anything. Please correct me, if im wrong.

Also, HDFS.CallTimeout – indicates calling HDFS for open,write operations. If no response within a duration, it timeouts. And , if its timeouts; it closes the File. Please correct me, if im wrong.  Also, if there is a way to specify the number of retries before it closes the file?



Your inputs/suggestions will be thoroughly appreciated.





Regards

Kushal Mangtani

Software Engineer