You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Thomas Vachon <va...@sessionm.com> on 2012/01/27 16:29:50 UTC

Unexplainable Bug in Flume Collectors

I have 10 logical collectors per a collector node.  2 for each log file I monitor (one for HDFS, one for S3 sinks).  I recently went from 8 to 10.  The 10th sink is failing 100% of the time.  On a node I see:

2012-01-27 15:19:05,104 INFO com.cloudera.flume.agent.durability.NaiveFileWALManager: opening log file 20120127-142931487+0000.6301485683111842.00000106
2012-01-27 15:19:05,105 INFO com.cloudera.flume.handlers.debug.StubbornAppendSink: append failed on event 'ip-10-212-145-75.ec2.internal [INFO Fri Jan 27 14:29:31 UTC 2012] { AckChecksum : (long)1327674571487  (string) '5?:?' (double)6.559583946287E-312 } { AckTag : 20120127-142931487+0000.6301485683111842.00000106 } { AckType : beg } ' with error: Append failed java.net.SocketException: Broken pipe
2012-01-27 15:19:05,105 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink on port 35862 closed
2012-01-27 15:19:05,106 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink on port 35862 closed
2012-01-27 15:19:05,108 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink to admin.internal.sessionm.com:35862 opened
2012-01-27 15:19:05,108 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink to ip-10-194-66-32.ec2.internal:35862 opened
2012-01-27 15:19:05,108 INFO com.cloudera.flume.handlers.debug.InsistentOpenDecorator: Opened BackoffFailover on try 0
2012-01-27 15:19:05,109 INFO com.cloudera.flume.agent.WALAckManager: Ack for 20120127-142931487+0000.6301485683111842.00000106 is queued to be checked
2012-01-27 15:19:05,109 INFO com.cloudera.flume.agent.durability.WALSource: end of file NaiveFileWALManager (dir=/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 )
2012-01-27 15:19:05,109 INFO com.cloudera.flume.agent.durability.NaiveFileWALManager: opening log file 20120127-143151597+0000.6301625792799805.00000106
2012-01-27 15:19:05,110 INFO com.cloudera.flume.agent.WALAckManager: Ack for 20120127-143151597+0000.6301625792799805.00000106 is queued to be checked
2012-01-27 15:19:05,110 INFO com.cloudera.flume.agent.durability.WALSource: end of file NaiveFileWALManager (dir=/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 )
2012-01-27 15:19:05,110 INFO com.cloudera.flume.agent.durability.NaiveFileWALManager: opening log file 20120127-151120751+0000.6303994947346458.00000351
2012-01-27 15:19:05,111 INFO com.cloudera.flume.agent.WALAckManager: Ack for 20120127-151120751+0000.6303994947346458.00000351 is queued to be checked
2012-01-27 15:19:05,111 INFO com.cloudera.flume.agent.durability.WALSource: end of file NaiveFileWALManager (dir=/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 )


On the collector, I see flow 9 (the HDFS sink flow of the same log file) working just fine.  I see that it opens the s3n sink for the S3 flow, but no data is being ingested.  On the node I see that we are seeing "Broken Pipe".  I suspect that is the problem, but I am unable to find a way to fix it.  I confirmed connectivity via telnet to the RCP source port.

I have exhausted all possible measure of fixing this.  I have unmapped, configured, and remapped every node to ensure we did not have a weird problem.  The master shows no errors, and 9 out of the 10 flows are working just as they should.

Does anyone have an idea?

--
Thomas Vachon
Principal Operations Architect
session M
vachon@sessionm.com

Re: Unexplainable Bug in Flume Collectors

Posted by Thomas Vachon <va...@sessionm.com>.

The answer is this is not a bug but user-error.  For anyone who has 10 logical nodes (besides the physical named logical node) on a physical node, make sure you are not running default port settings.  11 ports higher than the start of the source range is the http status page.


On Jan 27, 2012, at 10:56 AM, Thomas Vachon wrote:

> I did a packet capture I was just able to analyze.  I see the TCP handshake, some PSH,ACK's then the collector send a RST.  This is not a pattern I see in any other flow on that box.  It also is observed on my 2nd collector.  I have checked file limits/open descriptors, nothing is above its limits.
> 
> 
> On Jan 27, 2012, at 10:29 AM, Thomas Vachon wrote:
> 
>> I have 10 logical collectors per a collector node.  2 for each log file I monitor (one for HDFS, one for S3 sinks).  I recently went from 8 to 10.  The 10th sink is failing 100% of the time.  On a node I see:
>> 
>> 2012-01-27 15:19:05,104 INFO com.cloudera.flume.agent.durability.NaiveFileWALManager: opening log file 20120127-142931487+0000.6301485683111842.00000106
>> 2012-01-27 15:19:05,105 INFO com.cloudera.flume.handlers.debug.StubbornAppendSink: append failed on event 'ip-10-212-145-75.ec2.internal [INFO Fri Jan 27 14:29:31 UTC 2012] { AckChecksum : (long)1327674571487  (string) '5?:?' (double)6.559583946287E-312 } { AckTag : 20120127-142931487+0000.6301485683111842.00000106 } { AckType : beg } ' with error: Append failed java.net.SocketException: Broken pipe
>> 2012-01-27 15:19:05,105 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink on port 35862 closed
>> 2012-01-27 15:19:05,106 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink on port 35862 closed
>> 2012-01-27 15:19:05,108 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink to admin.internal.sessionm.com:35862 opened
>> 2012-01-27 15:19:05,108 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink to ip-10-194-66-32.ec2.internal:35862 opened
>> 2012-01-27 15:19:05,108 INFO com.cloudera.flume.handlers.debug.InsistentOpenDecorator: Opened BackoffFailover on try 0
>> 2012-01-27 15:19:05,109 INFO com.cloudera.flume.agent.WALAckManager: Ack for 20120127-142931487+0000.6301485683111842.00000106 is queued to be checked
>> 2012-01-27 15:19:05,109 INFO com.cloudera.flume.agent.durability.WALSource: end of file NaiveFileWALManager (dir=/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 )
>> 2012-01-27 15:19:05,109 INFO com.cloudera.flume.agent.durability.NaiveFileWALManager: opening log file 20120127-143151597+0000.6301625792799805.00000106
>> 2012-01-27 15:19:05,110 INFO com.cloudera.flume.agent.WALAckManager: Ack for 20120127-143151597+0000.6301625792799805.00000106 is queued to be checked
>> 2012-01-27 15:19:05,110 INFO com.cloudera.flume.agent.durability.WALSource: end of file NaiveFileWALManager (dir=/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 )
>> 2012-01-27 15:19:05,110 INFO com.cloudera.flume.agent.durability.NaiveFileWALManager: opening log file 20120127-151120751+0000.6303994947346458.00000351
>> 2012-01-27 15:19:05,111 INFO com.cloudera.flume.agent.WALAckManager: Ack for 20120127-151120751+0000.6303994947346458.00000351 is queued to be checked
>> 2012-01-27 15:19:05,111 INFO com.cloudera.flume.agent.durability.WALSource: end of file NaiveFileWALManager (dir=/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 )
>> 
>> 
>> On the collector, I see flow 9 (the HDFS sink flow of the same log file) working just fine.  I see that it opens the s3n sink for the S3 flow, but no data is being ingested.  On the node I see that we are seeing "Broken Pipe".  I suspect that is the problem, but I am unable to find a way to fix it.  I confirmed connectivity via telnet to the RCP source port.
>> 
>> I have exhausted all possible measure of fixing this.  I have unmapped, configured, and remapped every node to ensure we did not have a weird problem.  The master shows no errors, and 9 out of the 10 flows are working just as they should.
>> 
>> Does anyone have an idea?
>> 
>> --
>> Thomas Vachon
>> Principal Operations Architect
>> session M
>> vachon@sessionm.com
>> 
>> 
>

Re: Unexplainable Bug in Flume Collectors

Posted by Thomas Vachon <va...@sessionm.com>.

I did a packet capture I was just able to analyze.  I see the TCP handshake, some PSH,ACK's then the collector send a RST.  This is not a pattern I see in any other flow on that box.  It also is observed on my 2nd collector.  I have checked file limits/open descriptors, nothing is above its limits.


On Jan 27, 2012, at 10:29 AM, Thomas Vachon wrote:

> I have 10 logical collectors per a collector node.  2 for each log file I monitor (one for HDFS, one for S3 sinks).  I recently went from 8 to 10.  The 10th sink is failing 100% of the time.  On a node I see:
> 
> 2012-01-27 15:19:05,104 INFO com.cloudera.flume.agent.durability.NaiveFileWALManager: opening log file 20120127-142931487+0000.6301485683111842.00000106
> 2012-01-27 15:19:05,105 INFO com.cloudera.flume.handlers.debug.StubbornAppendSink: append failed on event 'ip-10-212-145-75.ec2.internal [INFO Fri Jan 27 14:29:31 UTC 2012] { AckChecksum : (long)1327674571487  (string) '5?:?' (double)6.559583946287E-312 } { AckTag : 20120127-142931487+0000.6301485683111842.00000106 } { AckType : beg } ' with error: Append failed java.net.SocketException: Broken pipe
> 2012-01-27 15:19:05,105 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink on port 35862 closed
> 2012-01-27 15:19:05,106 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink on port 35862 closed
> 2012-01-27 15:19:05,108 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink to admin.internal.sessionm.com:35862 opened
> 2012-01-27 15:19:05,108 INFO com.cloudera.flume.handlers.thrift.ThriftEventSink: ThriftEventSink to ip-10-194-66-32.ec2.internal:35862 opened
> 2012-01-27 15:19:05,108 INFO com.cloudera.flume.handlers.debug.InsistentOpenDecorator: Opened BackoffFailover on try 0
> 2012-01-27 15:19:05,109 INFO com.cloudera.flume.agent.WALAckManager: Ack for 20120127-142931487+0000.6301485683111842.00000106 is queued to be checked
> 2012-01-27 15:19:05,109 INFO com.cloudera.flume.agent.durability.WALSource: end of file NaiveFileWALManager (dir=/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 )
> 2012-01-27 15:19:05,109 INFO com.cloudera.flume.agent.durability.NaiveFileWALManager: opening log file 20120127-143151597+0000.6301625792799805.00000106
> 2012-01-27 15:19:05,110 INFO com.cloudera.flume.agent.WALAckManager: Ack for 20120127-143151597+0000.6301625792799805.00000106 is queued to be checked
> 2012-01-27 15:19:05,110 INFO com.cloudera.flume.agent.durability.WALSource: end of file NaiveFileWALManager (dir=/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 )
> 2012-01-27 15:19:05,110 INFO com.cloudera.flume.agent.durability.NaiveFileWALManager: opening log file 20120127-151120751+0000.6303994947346458.00000351
> 2012-01-27 15:19:05,111 INFO com.cloudera.flume.agent.WALAckManager: Ack for 20120127-151120751+0000.6303994947346458.00000351 is queued to be checked
> 2012-01-27 15:19:05,111 INFO com.cloudera.flume.agent.durability.WALSource: end of file NaiveFileWALManager (dir=/mnt/flume/flume-flume/agent/coreSiteS3-i-654dd706 )
> 
> 
> On the collector, I see flow 9 (the HDFS sink flow of the same log file) working just fine.  I see that it opens the s3n sink for the S3 flow, but no data is being ingested.  On the node I see that we are seeing "Broken Pipe".  I suspect that is the problem, but I am unable to find a way to fix it.  I confirmed connectivity via telnet to the RCP source port.
> 
> I have exhausted all possible measure of fixing this.  I have unmapped, configured, and remapped every node to ensure we did not have a weird problem.  The master shows no errors, and 9 out of the 10 flows are working just as they should.
> 
> Does anyone have an idea?
> 
> --
> Thomas Vachon
> Principal Operations Architect
> session M
> vachon@sessionm.com
> 
>