You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by aaron morton <aa...@thelastpickle.com> on 2011/08/14 12:18:07 UTC

race conditions and WAL shutdown

I'm starting to play with Flume and running into some issues. I've found a couple of tickets that talk about similar things, but  wanted to check  if it was a problem on my side…

Using 0.9.4 bin distro on OSX lion, all local, Java 1.6.0_26

start a master and bin/flume node_nowatch -n agen and bin/flume node_nowatch -n coll

Then add the following config via the web portal for the master (am using batch / unbatch as a test for aggregating data later)…

agen : text("/my/file") | agentSink("localhost",35853) ; 
coll : collectorSource(35853) | batch(100) unbatch console ; 

On the agen node it goes bang and shuts it's self down…

21:40:53,064  INFO NaiveFileWALManager:179 - NaiveFileWALManager is now open
21:40:53,066  INFO InsistentOpenDecorator:128 - open attempt 0 failed, backoff (1000ms): Failed to open thrift event sink to localhost:35853 : java.net.ConnectException: Connection refused
21:40:54,063 ERROR DirectDriver:311 - Expected ACTIVE but timed out in state OPENING
21:40:54,069  INFO ThriftEventSink:112 - ThriftEventSink to localhost:35853 opened
21:40:54,069  INFO InsistentOpenDecorator:116 - Opened ThriftEventSink on try 1
21:40:54,069  INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting down
21:40:54,069  INFO ThriftEventSink:89 - ThriftEventSink on port 35853 closed
21:40:54,218  INFO TextFileSource:70 - File /Users/aaron/data/db-pedia/stats/pagecounts-20110625-010000-distributed-small closed
21:40:54,218  INFO RollSink:206 - closing RollSink 'ackingWal'
21:40:54,219  INFO SeqfileEventSink:92 - closed /tmp/flume-aaron/agent/agen/writing/20110814-214053048+1200.1313314853048500000.00000023
21:40:54,219  INFO AckListener$Empty:45 - Empty Ack Listener ended 20110814-214053048+1200.1313314853048500000.00000023
21:40:54,220  INFO NaiveFileWALManager:474 - File lives in /tmp/flume-aaron/agent/agen/writing/20110814-214053048+1200.1313314853048500000.00000023
21:40:54,220  WARN NaiveFileWALManager:205 - Already shutting down, but getting another shutting down notice, odd
21:40:54,221  INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting down

I waited for a while, no retry. So killed both processes and started the coll then the agen process. The lines from the source file were delivered and this is what I see in the agen process…

21:52:43,397  INFO NaiveFileWALManager:760 - opening log file 20110814-214053048+1200.1313314853048500000.00000023
21:52:43,605  INFO TextFileSource:70 - File /Users/aaron/data/db-pedia/stats/pagecounts-20110625-010000-distributed-small closed
21:52:43,605  INFO RollSink:206 - closing RollSink 'ackingWal'
21:52:43,608  INFO SeqfileEventSink:92 - closed /tmp/flume-aaron/agent/agen/writing/20110814-215243380+1200.1313315563380444000.00000023
21:52:43,609  INFO AckListener$Empty:45 - Empty Ack Listener ended 20110814-215243380+1200.1313315563380444000.00000023
21:52:43,609  INFO NaiveFileWALManager:474 - File lives in /tmp/flume-aaron/agent/agen/writing/20110814-215243380+1200.1313315563380444000.00000023
21:52:43,610  INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting down
21:52:43,952  INFO WALAckManager:92 - Ack for 20110814-214053048+1200.1313314853048500000.00000023 is queued to be checked
21:52:43,952  INFO WALSource:97 - end of file NaiveFileWALManager (dir=/tmp/flume-aaron/agent/agen )
21:52:43,952  INFO NaiveFileWALManager:760 - opening log file 20110814-215243380+1200.1313315563380444000.00000023
21:52:44,110 ERROR DirectDriver:311 - Expected IDLE but timed out in state ACTIVE
21:52:44,111  INFO NaiveFileWALDeco:185 - Closing wal log, subsink still making progress
21:52:44,316  INFO WALAckManager:92 - Ack for 20110814-215243380+1200.1313315563380444000.00000023 is queued to be checked
21:52:44,316  INFO WALSource:97 - end of file NaiveFileWALManager (dir=/tmp/flume-aaron/agent/agen )
21:52:44,517  WARN NaiveFileWALManager:205 - Already shutting down, but getting another shutting down notice, odd
21:52:44,518  INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting down
21:52:44,518  INFO ThriftEventSink:89 - ThriftEventSink on port 35853 closed
21:53:48,409  INFO WALAckManager:153 - Retransmitting 20110814-215243380+1200.1313315563380444000.00000023 after being stale for 64093ms
21:53:48,409  INFO WALAckManager:153 - Retransmitting 20110814-214053048+1200.1313314853048500000.00000023 after being stale for 64457ms
21:54:48,444  INFO WALAckManager:153 - Retransmitting 20110814-215243380+1200.1313315563380444000.00000023 after being stale for 60035ms
21:54:48,444  WARN NaiveFileWALManager:789 - There was a race that happend with SENT vs SENDING states
21:54:48,444  INFO WALAckManager:153 - Retransmitting 20110814-214053048+1200.1313314853048500000.00000023 after being stale for 60035ms
21:54:48,444  WARN NaiveFileWALManager:789 - There was a race that happend with SENT vs SENDING states
21:55:48,478  INFO WALAckManager:153 - Retransmitting 20110814-215243380+1200.1313315563380444000.00000023 after being stale for 60034ms
21:55:48,479  WARN NaiveFileWALManager:789 - There was a race that happend with SENT vs SENDING states

I noticed nothing in the logs for coll. 


My questions are:

1) Am I going about this test the right way ?

2) For the first case, should the agen process retry ? It sounds a little like this https://issues.apache.org/jira/browse/FLUME-711 

3) Are the "NaiveFileWALManager shutting down" messages as bad as the sound ? I'm guessing not because they are INFO.  

4) Is there a timeout to set to avoid "ERROR DirectDriver:311 - Expected ACTIVE but timed out in state OPENING" ? 

5) "INFO ThriftEventSink:89 - ThriftEventSink on port 35853 closed"  I checked lsof with the processes still running and saw
202-126-206-214:tmp aaron$ lsof -i -P | grep 35853
java      44571 aaron  113u  IPv6 0xffffff8017bec600      0t0  TCP *:35853 (LISTEN)

6) "WARN NaiveFileWALManager:789 - There was a race that happend with SENT vs SENDING states" Is this a known issue ? Am I seeing this ? https://issues.apache.org/jira/browse/FLUME-637 

7) "INFO WALAckManager:153 -  Retransmitting 20110814-214053048+1200.1313314853048500000.00000023 after being stale for 60035ms" Went on as along as I left the agen process running, there was nothing been displayed on the coll though. Is this also part of FLUME-637 ? 

Thanks for any help. 
 
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

Re: race conditions and WAL shutdown

Posted by aaron morton <aa...@thelastpickle.com>.

Thanks I'll take a look and see how I get on. 

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 16/08/2011, at 6:17 AM, Mingjie Lai wrote:

> 
> Saw the problem before. The race condition warning really confused me at the very beginning, but it turned out to be a configure issue. Please take a look at these threads:
> 
> https://groups.google.com/a/cloudera.org/group/flume-user/browse_thread/thread/c35f07463c5c2721/0f63f323f066a827?lnk=gst&q=agente2esink+#0f63f323f066a827
> 
> https://groups.google.com/a/cloudera.org/group/flume-user/browse_thread/thread/64e786384928a505/f023c918081e8e48?lnk=gst&q=agente2esink+#f023c918081e8e48
> 
> In short, you need to use collector wrapper for agentE2ESink.
> 
> Thanks,
> Mingjie
> 
> 
> On 08/14/2011 03:18 AM, aaron morton wrote:
>> I'm starting to play with Flume and running into some issues. I've found
>> a couple of tickets that talk about similar things, but wanted to check
>> if it was a problem on my side…
>> 
>> Using 0.9.4 bin distro on OSX lion, all local, Java 1.6.0_26
>> 
>> start a master and bin/flume node_nowatch -n agen and bin/flume
>> node_nowatch -n coll
>> 
>> Then add the following config via the web portal for the master (am
>> using batch / unbatch as a test for aggregating data later)…
>> 
>> agen : text("/my/file") | agentSink("localhost",35853) ;
>> coll : collectorSource(35853) | batch(100) unbatch console ;
>> 
>> On the agen node it goes bang and shuts it's self down…
>> 
>> 21:40:53,064 INFO NaiveFileWALManager:179 - NaiveFileWALManager is now open
>> 21:40:53,066 INFO InsistentOpenDecorator:128 - open attempt 0 failed,
>> backoff (1000ms): Failed to open thrift event sink to localhost:35853 :
>> java.net.ConnectException: Connection refused
>> 21:40:54,063 ERROR DirectDriver:311 - Expected ACTIVE but timed out in
>> state OPENING
>> 21:40:54,069 INFO ThriftEventSink:112 - ThriftEventSink to
>> localhost:35853 opened
>> 21:40:54,069 INFO InsistentOpenDecorator:116 - Opened ThriftEventSink on
>> try 1
>> 21:40:54,069 INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting
>> down
>> 21:40:54,069 INFO ThriftEventSink:89 - ThriftEventSink on port 35853 closed
>> 21:40:54,218 INFO TextFileSource:70 - File
>> /Users/aaron/data/db-pedia/stats/pagecounts-20110625-010000-distributed-small
>> closed
>> 21:40:54,218 INFO RollSink:206 - closing RollSink 'ackingWal'
>> 21:40:54,219 INFO SeqfileEventSink:92 - closed
>> /tmp/flume-aaron/agent/agen/writing/20110814-214053048+1200.1313314853048500000.00000023
>> 21:40:54,219 INFO AckListener$Empty:45 - Empty Ack Listener ended
>> 20110814-214053048+1200.1313314853048500000.00000023
>> 21:40:54,220 INFO NaiveFileWALManager:474 - File lives in
>> /tmp/flume-aaron/agent/agen/writing/20110814-214053048+1200.1313314853048500000.00000023
>> 21:40:54,220 WARN NaiveFileWALManager:205 - Already shutting down, but
>> getting another shutting down notice, odd
>> 21:40:54,221 INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting
>> down
>> 
>> I waited for a while, no retry. So killed both processes and started the
>> coll then the agen process. The lines from the source file were
>> delivered and this is what I see in the agen process…
>> 
>> 21:52:43,397 INFO NaiveFileWALManager:760 - opening log file
>> 20110814-214053048+1200.1313314853048500000.00000023
>> 21:52:43,605 INFO TextFileSource:70 - File
>> /Users/aaron/data/db-pedia/stats/pagecounts-20110625-010000-distributed-small
>> closed
>> 21:52:43,605 INFO RollSink:206 - closing RollSink 'ackingWal'
>> 21:52:43,608 INFO SeqfileEventSink:92 - closed
>> /tmp/flume-aaron/agent/agen/writing/20110814-215243380+1200.1313315563380444000.00000023
>> 21:52:43,609 INFO AckListener$Empty:45 - Empty Ack Listener ended
>> 20110814-215243380+1200.1313315563380444000.00000023
>> 21:52:43,609 INFO NaiveFileWALManager:474 - File lives in
>> /tmp/flume-aaron/agent/agen/writing/20110814-215243380+1200.1313315563380444000.00000023
>> 21:52:43,610 INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting
>> down
>> 21:52:43,952 INFO WALAckManager:92 - Ack for
>> 20110814-214053048+1200.1313314853048500000.00000023 is queued to be checked
>> 21:52:43,952 INFO WALSource:97 - end of file NaiveFileWALManager
>> (dir=/tmp/flume-aaron/agent/agen )
>> 21:52:43,952 INFO NaiveFileWALManager:760 - opening log file
>> 20110814-215243380+1200.1313315563380444000.00000023
>> 21:52:44,110 ERROR DirectDriver:311 - Expected IDLE but timed out in
>> state ACTIVE
>> 21:52:44,111 INFO NaiveFileWALDeco:185 - Closing wal log, subsink still
>> making progress
>> 21:52:44,316 INFO WALAckManager:92 - Ack for
>> 20110814-215243380+1200.1313315563380444000.00000023 is queued to be checked
>> 21:52:44,316 INFO WALSource:97 - end of file NaiveFileWALManager
>> (dir=/tmp/flume-aaron/agent/agen )
>> 21:52:44,517 WARN NaiveFileWALManager:205 - Already shutting down, but
>> getting another shutting down notice, odd
>> 21:52:44,518 INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting
>> down
>> 21:52:44,518 INFO ThriftEventSink:89 - ThriftEventSink on port 35853 closed
>> 21:53:48,409 INFO WALAckManager:153 - Retransmitting
>> 20110814-215243380+1200.1313315563380444000.00000023 after being stale
>> for 64093ms
>> 21:53:48,409 INFO WALAckManager:153 - Retransmitting
>> 20110814-214053048+1200.1313314853048500000.00000023 after being stale
>> for 64457ms
>> 21:54:48,444 INFO WALAckManager:153 - Retransmitting
>> 20110814-215243380+1200.1313315563380444000.00000023 after being stale
>> for 60035ms
>> 21:54:48,444 WARN NaiveFileWALManager:789 - There was a race that
>> happend with SENT vs SENDING states
>> 21:54:48,444 INFO WALAckManager:153 - Retransmitting
>> 20110814-214053048+1200.1313314853048500000.00000023 after being stale
>> for 60035ms
>> 21:54:48,444 WARN NaiveFileWALManager:789 - There was a race that
>> happend with SENT vs SENDING states
>> 21:55:48,478 INFO WALAckManager:153 - Retransmitting
>> 20110814-215243380+1200.1313315563380444000.00000023 after being stale
>> for 60034ms
>> 21:55:48,479 WARN NaiveFileWALManager:789 - There was a race that
>> happend with SENT vs SENDING states
>> 
>> I noticed nothing in the logs for coll.
>> 
>> 
>> My questions are:
>> 
>> 1) Am I going about this test the right way ?
>> 
>> 2) For the first case, should the agen process retry ? It sounds a
>> little like this https://issues.apache.org/jira/browse/FLUME-711
>> 
>> 3) Are the "NaiveFileWALManager shutting down" messages as bad as the
>> sound ? I'm guessing not because they are INFO.
>> 
>> 4) Is there a timeout to set to avoid "ERROR DirectDriver:311 - Expected
>> ACTIVE but timed out in state OPENING" ?
>> 
>> 5) "INFO ThriftEventSink:89 - ThriftEventSink on port 35853 closed" I
>> checked lsof with the processes still running and saw
>> 202-126-206-214:tmp aaron$ lsof -i -P | grep 35853
>> java 44571 aaron 113u IPv6 0xffffff8017bec600 0t0 TCP *:35853 (LISTEN)
>> 
>> 6) "WARN NaiveFileWALManager:789 - There was a race that happend with
>> SENT vs SENDING states" Is this a known issue ? Am I seeing this ?
>> https://issues.apache.org/jira/browse/FLUME-637
>> 
>> 7) "INFO WALAckManager:153 - Retransmitting
>> 20110814-214053048+1200.1313314853048500000.00000023 after being stale
>> for 60035ms" Went on as along as I left the agen process running, there
>> was nothing been displayed on the coll though. Is this also part of
>> FLUME-637 ?
>> 
>> Thanks for any help.
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>

Re: race conditions and WAL shutdown

Posted by Mingjie Lai <mj...@gmail.com>.

Saw the problem before. The race condition warning really confused me at 
the very beginning, but it turned out to be a configure issue. Please 
take a look at these threads:

https://groups.google.com/a/cloudera.org/group/flume-user/browse_thread/thread/c35f07463c5c2721/0f63f323f066a827?lnk=gst&q=agente2esink+#0f63f323f066a827

https://groups.google.com/a/cloudera.org/group/flume-user/browse_thread/thread/64e786384928a505/f023c918081e8e48?lnk=gst&q=agente2esink+#f023c918081e8e48

In short, you need to use collector wrapper for agentE2ESink.

Thanks,
Mingjie


On 08/14/2011 03:18 AM, aaron morton wrote:
> I'm starting to play with Flume and running into some issues. I've found
> a couple of tickets that talk about similar things, but wanted to check
> if it was a problem on my side…
>
> Using 0.9.4 bin distro on OSX lion, all local, Java 1.6.0_26
>
> start a master and bin/flume node_nowatch -n agen and bin/flume
> node_nowatch -n coll
>
> Then add the following config via the web portal for the master (am
> using batch / unbatch as a test for aggregating data later)…
>
> agen : text("/my/file") | agentSink("localhost",35853) ;
> coll : collectorSource(35853) | batch(100) unbatch console ;
>
> On the agen node it goes bang and shuts it's self down…
>
> 21:40:53,064 INFO NaiveFileWALManager:179 - NaiveFileWALManager is now open
> 21:40:53,066 INFO InsistentOpenDecorator:128 - open attempt 0 failed,
> backoff (1000ms): Failed to open thrift event sink to localhost:35853 :
> java.net.ConnectException: Connection refused
> 21:40:54,063 ERROR DirectDriver:311 - Expected ACTIVE but timed out in
> state OPENING
> 21:40:54,069 INFO ThriftEventSink:112 - ThriftEventSink to
> localhost:35853 opened
> 21:40:54,069 INFO InsistentOpenDecorator:116 - Opened ThriftEventSink on
> try 1
> 21:40:54,069 INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting
> down
> 21:40:54,069 INFO ThriftEventSink:89 - ThriftEventSink on port 35853 closed
> 21:40:54,218 INFO TextFileSource:70 - File
> /Users/aaron/data/db-pedia/stats/pagecounts-20110625-010000-distributed-small
> closed
> 21:40:54,218 INFO RollSink:206 - closing RollSink 'ackingWal'
> 21:40:54,219 INFO SeqfileEventSink:92 - closed
> /tmp/flume-aaron/agent/agen/writing/20110814-214053048+1200.1313314853048500000.00000023
> 21:40:54,219 INFO AckListener$Empty:45 - Empty Ack Listener ended
> 20110814-214053048+1200.1313314853048500000.00000023
> 21:40:54,220 INFO NaiveFileWALManager:474 - File lives in
> /tmp/flume-aaron/agent/agen/writing/20110814-214053048+1200.1313314853048500000.00000023
> 21:40:54,220 WARN NaiveFileWALManager:205 - Already shutting down, but
> getting another shutting down notice, odd
> 21:40:54,221 INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting
> down
>
> I waited for a while, no retry. So killed both processes and started the
> coll then the agen process. The lines from the source file were
> delivered and this is what I see in the agen process…
>
> 21:52:43,397 INFO NaiveFileWALManager:760 - opening log file
> 20110814-214053048+1200.1313314853048500000.00000023
> 21:52:43,605 INFO TextFileSource:70 - File
> /Users/aaron/data/db-pedia/stats/pagecounts-20110625-010000-distributed-small
> closed
> 21:52:43,605 INFO RollSink:206 - closing RollSink 'ackingWal'
> 21:52:43,608 INFO SeqfileEventSink:92 - closed
> /tmp/flume-aaron/agent/agen/writing/20110814-215243380+1200.1313315563380444000.00000023
> 21:52:43,609 INFO AckListener$Empty:45 - Empty Ack Listener ended
> 20110814-215243380+1200.1313315563380444000.00000023
> 21:52:43,609 INFO NaiveFileWALManager:474 - File lives in
> /tmp/flume-aaron/agent/agen/writing/20110814-215243380+1200.1313315563380444000.00000023
> 21:52:43,610 INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting
> down
> 21:52:43,952 INFO WALAckManager:92 - Ack for
> 20110814-214053048+1200.1313314853048500000.00000023 is queued to be checked
> 21:52:43,952 INFO WALSource:97 - end of file NaiveFileWALManager
> (dir=/tmp/flume-aaron/agent/agen )
> 21:52:43,952 INFO NaiveFileWALManager:760 - opening log file
> 20110814-215243380+1200.1313315563380444000.00000023
> 21:52:44,110 ERROR DirectDriver:311 - Expected IDLE but timed out in
> state ACTIVE
> 21:52:44,111 INFO NaiveFileWALDeco:185 - Closing wal log, subsink still
> making progress
> 21:52:44,316 INFO WALAckManager:92 - Ack for
> 20110814-215243380+1200.1313315563380444000.00000023 is queued to be checked
> 21:52:44,316 INFO WALSource:97 - end of file NaiveFileWALManager
> (dir=/tmp/flume-aaron/agent/agen )
> 21:52:44,517 WARN NaiveFileWALManager:205 - Already shutting down, but
> getting another shutting down notice, odd
> 21:52:44,518 INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting
> down
> 21:52:44,518 INFO ThriftEventSink:89 - ThriftEventSink on port 35853 closed
> 21:53:48,409 INFO WALAckManager:153 - Retransmitting
> 20110814-215243380+1200.1313315563380444000.00000023 after being stale
> for 64093ms
> 21:53:48,409 INFO WALAckManager:153 - Retransmitting
> 20110814-214053048+1200.1313314853048500000.00000023 after being stale
> for 64457ms
> 21:54:48,444 INFO WALAckManager:153 - Retransmitting
> 20110814-215243380+1200.1313315563380444000.00000023 after being stale
> for 60035ms
> 21:54:48,444 WARN NaiveFileWALManager:789 - There was a race that
> happend with SENT vs SENDING states
> 21:54:48,444 INFO WALAckManager:153 - Retransmitting
> 20110814-214053048+1200.1313314853048500000.00000023 after being stale
> for 60035ms
> 21:54:48,444 WARN NaiveFileWALManager:789 - There was a race that
> happend with SENT vs SENDING states
> 21:55:48,478 INFO WALAckManager:153 - Retransmitting
> 20110814-215243380+1200.1313315563380444000.00000023 after being stale
> for 60034ms
> 21:55:48,479 WARN NaiveFileWALManager:789 - There was a race that
> happend with SENT vs SENDING states
>
> I noticed nothing in the logs for coll.
>
>
> My questions are:
>
> 1) Am I going about this test the right way ?
>
> 2) For the first case, should the agen process retry ? It sounds a
> little like this https://issues.apache.org/jira/browse/FLUME-711
>
> 3) Are the "NaiveFileWALManager shutting down" messages as bad as the
> sound ? I'm guessing not because they are INFO.
>
> 4) Is there a timeout to set to avoid "ERROR DirectDriver:311 - Expected
> ACTIVE but timed out in state OPENING" ?
>
> 5) "INFO ThriftEventSink:89 - ThriftEventSink on port 35853 closed" I
> checked lsof with the processes still running and saw
> 202-126-206-214:tmp aaron$ lsof -i -P | grep 35853
> java 44571 aaron 113u IPv6 0xffffff8017bec600 0t0 TCP *:35853 (LISTEN)
>
> 6) "WARN NaiveFileWALManager:789 - There was a race that happend with
> SENT vs SENDING states" Is this a known issue ? Am I seeing this ?
> https://issues.apache.org/jira/browse/FLUME-637
>
> 7) "INFO WALAckManager:153 - Retransmitting
> 20110814-214053048+1200.1313314853048500000.00000023 after being stale
> for 60035ms" Went on as along as I left the agen process running, there
> was nothing been displayed on the coll though. Is this also part of
> FLUME-637 ?
>
> Thanks for any help.
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>