You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flume.apache.org by aaron morton <aa...@thelastpickle.com> on 2011/08/14 12:18:07 UTC
race conditions and WAL shutdown
I'm starting to play with Flume and running into some issues. I've found a couple of tickets that talk about similar things, but wanted to check if it was a problem on my side…
Using 0.9.4 bin distro on OSX lion, all local, Java 1.6.0_26
start a master and bin/flume node_nowatch -n agen and bin/flume node_nowatch -n coll
Then add the following config via the web portal for the master (am using batch / unbatch as a test for aggregating data later)…
agen : text("/my/file") | agentSink("localhost",35853) ;
coll : collectorSource(35853) | batch(100) unbatch console ;
On the agen node it goes bang and shuts it's self down…
21:40:53,064 INFO NaiveFileWALManager:179 - NaiveFileWALManager is now open
21:40:53,066 INFO InsistentOpenDecorator:128 - open attempt 0 failed, backoff (1000ms): Failed to open thrift event sink to localhost:35853 : java.net.ConnectException: Connection refused
21:40:54,063 ERROR DirectDriver:311 - Expected ACTIVE but timed out in state OPENING
21:40:54,069 INFO ThriftEventSink:112 - ThriftEventSink to localhost:35853 opened
21:40:54,069 INFO InsistentOpenDecorator:116 - Opened ThriftEventSink on try 1
21:40:54,069 INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting down
21:40:54,069 INFO ThriftEventSink:89 - ThriftEventSink on port 35853 closed
21:40:54,218 INFO TextFileSource:70 - File /Users/aaron/data/db-pedia/stats/pagecounts-20110625-010000-distributed-small closed
21:40:54,218 INFO RollSink:206 - closing RollSink 'ackingWal'
21:40:54,219 INFO SeqfileEventSink:92 - closed /tmp/flume-aaron/agent/agen/writing/20110814-214053048+1200.1313314853048500000.00000023
21:40:54,219 INFO AckListener$Empty:45 - Empty Ack Listener ended 20110814-214053048+1200.1313314853048500000.00000023
21:40:54,220 INFO NaiveFileWALManager:474 - File lives in /tmp/flume-aaron/agent/agen/writing/20110814-214053048+1200.1313314853048500000.00000023
21:40:54,220 WARN NaiveFileWALManager:205 - Already shutting down, but getting another shutting down notice, odd
21:40:54,221 INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting down
I waited for a while, no retry. So killed both processes and started the coll then the agen process. The lines from the source file were delivered and this is what I see in the agen process…
21:52:43,397 INFO NaiveFileWALManager:760 - opening log file 20110814-214053048+1200.1313314853048500000.00000023
21:52:43,605 INFO TextFileSource:70 - File /Users/aaron/data/db-pedia/stats/pagecounts-20110625-010000-distributed-small closed
21:52:43,605 INFO RollSink:206 - closing RollSink 'ackingWal'
21:52:43,608 INFO SeqfileEventSink:92 - closed /tmp/flume-aaron/agent/agen/writing/20110814-215243380+1200.1313315563380444000.00000023
21:52:43,609 INFO AckListener$Empty:45 - Empty Ack Listener ended 20110814-215243380+1200.1313315563380444000.00000023
21:52:43,609 INFO NaiveFileWALManager:474 - File lives in /tmp/flume-aaron/agent/agen/writing/20110814-215243380+1200.1313315563380444000.00000023
21:52:43,610 INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting down
21:52:43,952 INFO WALAckManager:92 - Ack for 20110814-214053048+1200.1313314853048500000.00000023 is queued to be checked
21:52:43,952 INFO WALSource:97 - end of file NaiveFileWALManager (dir=/tmp/flume-aaron/agent/agen )
21:52:43,952 INFO NaiveFileWALManager:760 - opening log file 20110814-215243380+1200.1313315563380444000.00000023
21:52:44,110 ERROR DirectDriver:311 - Expected IDLE but timed out in state ACTIVE
21:52:44,111 INFO NaiveFileWALDeco:185 - Closing wal log, subsink still making progress
21:52:44,316 INFO WALAckManager:92 - Ack for 20110814-215243380+1200.1313315563380444000.00000023 is queued to be checked
21:52:44,316 INFO WALSource:97 - end of file NaiveFileWALManager (dir=/tmp/flume-aaron/agent/agen )
21:52:44,517 WARN NaiveFileWALManager:205 - Already shutting down, but getting another shutting down notice, odd
21:52:44,518 INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting down
21:52:44,518 INFO ThriftEventSink:89 - ThriftEventSink on port 35853 closed
21:53:48,409 INFO WALAckManager:153 - Retransmitting 20110814-215243380+1200.1313315563380444000.00000023 after being stale for 64093ms
21:53:48,409 INFO WALAckManager:153 - Retransmitting 20110814-214053048+1200.1313314853048500000.00000023 after being stale for 64457ms
21:54:48,444 INFO WALAckManager:153 - Retransmitting 20110814-215243380+1200.1313315563380444000.00000023 after being stale for 60035ms
21:54:48,444 WARN NaiveFileWALManager:789 - There was a race that happend with SENT vs SENDING states
21:54:48,444 INFO WALAckManager:153 - Retransmitting 20110814-214053048+1200.1313314853048500000.00000023 after being stale for 60035ms
21:54:48,444 WARN NaiveFileWALManager:789 - There was a race that happend with SENT vs SENDING states
21:55:48,478 INFO WALAckManager:153 - Retransmitting 20110814-215243380+1200.1313315563380444000.00000023 after being stale for 60034ms
21:55:48,479 WARN NaiveFileWALManager:789 - There was a race that happend with SENT vs SENDING states
I noticed nothing in the logs for coll.
My questions are:
1) Am I going about this test the right way ?
2) For the first case, should the agen process retry ? It sounds a little like this https://issues.apache.org/jira/browse/FLUME-711
3) Are the "NaiveFileWALManager shutting down" messages as bad as the sound ? I'm guessing not because they are INFO.
4) Is there a timeout to set to avoid "ERROR DirectDriver:311 - Expected ACTIVE but timed out in state OPENING" ?
5) "INFO ThriftEventSink:89 - ThriftEventSink on port 35853 closed" I checked lsof with the processes still running and saw
202-126-206-214:tmp aaron$ lsof -i -P | grep 35853
java 44571 aaron 113u IPv6 0xffffff8017bec600 0t0 TCP *:35853 (LISTEN)
6) "WARN NaiveFileWALManager:789 - There was a race that happend with SENT vs SENDING states" Is this a known issue ? Am I seeing this ? https://issues.apache.org/jira/browse/FLUME-637
7) "INFO WALAckManager:153 - Retransmitting 20110814-214053048+1200.1313314853048500000.00000023 after being stale for 60035ms" Went on as along as I left the agen process running, there was nothing been displayed on the coll though. Is this also part of FLUME-637 ?
Thanks for any help.
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com
Re: race conditions and WAL shutdown
Posted by aaron morton <aa...@thelastpickle.com>.
Thanks I'll take a look and see how I get on.
Cheers
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com
On 16/08/2011, at 6:17 AM, Mingjie Lai wrote:
>
> Saw the problem before. The race condition warning really confused me at the very beginning, but it turned out to be a configure issue. Please take a look at these threads:
>
> https://groups.google.com/a/cloudera.org/group/flume-user/browse_thread/thread/c35f07463c5c2721/0f63f323f066a827?lnk=gst&q=agente2esink+#0f63f323f066a827
>
> https://groups.google.com/a/cloudera.org/group/flume-user/browse_thread/thread/64e786384928a505/f023c918081e8e48?lnk=gst&q=agente2esink+#f023c918081e8e48
>
> In short, you need to use collector wrapper for agentE2ESink.
>
> Thanks,
> Mingjie
>
>
> On 08/14/2011 03:18 AM, aaron morton wrote:
>> I'm starting to play with Flume and running into some issues. I've found
>> a couple of tickets that talk about similar things, but wanted to check
>> if it was a problem on my side…
>>
>> Using 0.9.4 bin distro on OSX lion, all local, Java 1.6.0_26
>>
>> start a master and bin/flume node_nowatch -n agen and bin/flume
>> node_nowatch -n coll
>>
>> Then add the following config via the web portal for the master (am
>> using batch / unbatch as a test for aggregating data later)…
>>
>> agen : text("/my/file") | agentSink("localhost",35853) ;
>> coll : collectorSource(35853) | batch(100) unbatch console ;
>>
>> On the agen node it goes bang and shuts it's self down…
>>
>> 21:40:53,064 INFO NaiveFileWALManager:179 - NaiveFileWALManager is now open
>> 21:40:53,066 INFO InsistentOpenDecorator:128 - open attempt 0 failed,
>> backoff (1000ms): Failed to open thrift event sink to localhost:35853 :
>> java.net.ConnectException: Connection refused
>> 21:40:54,063 ERROR DirectDriver:311 - Expected ACTIVE but timed out in
>> state OPENING
>> 21:40:54,069 INFO ThriftEventSink:112 - ThriftEventSink to
>> localhost:35853 opened
>> 21:40:54,069 INFO InsistentOpenDecorator:116 - Opened ThriftEventSink on
>> try 1
>> 21:40:54,069 INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting
>> down
>> 21:40:54,069 INFO ThriftEventSink:89 - ThriftEventSink on port 35853 closed
>> 21:40:54,218 INFO TextFileSource:70 - File
>> /Users/aaron/data/db-pedia/stats/pagecounts-20110625-010000-distributed-small
>> closed
>> 21:40:54,218 INFO RollSink:206 - closing RollSink 'ackingWal'
>> 21:40:54,219 INFO SeqfileEventSink:92 - closed
>> /tmp/flume-aaron/agent/agen/writing/20110814-214053048+1200.1313314853048500000.00000023
>> 21:40:54,219 INFO AckListener$Empty:45 - Empty Ack Listener ended
>> 20110814-214053048+1200.1313314853048500000.00000023
>> 21:40:54,220 INFO NaiveFileWALManager:474 - File lives in
>> /tmp/flume-aaron/agent/agen/writing/20110814-214053048+1200.1313314853048500000.00000023
>> 21:40:54,220 WARN NaiveFileWALManager:205 - Already shutting down, but
>> getting another shutting down notice, odd
>> 21:40:54,221 INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting
>> down
>>
>> I waited for a while, no retry. So killed both processes and started the
>> coll then the agen process. The lines from the source file were
>> delivered and this is what I see in the agen process…
>>
>> 21:52:43,397 INFO NaiveFileWALManager:760 - opening log file
>> 20110814-214053048+1200.1313314853048500000.00000023
>> 21:52:43,605 INFO TextFileSource:70 - File
>> /Users/aaron/data/db-pedia/stats/pagecounts-20110625-010000-distributed-small
>> closed
>> 21:52:43,605 INFO RollSink:206 - closing RollSink 'ackingWal'
>> 21:52:43,608 INFO SeqfileEventSink:92 - closed
>> /tmp/flume-aaron/agent/agen/writing/20110814-215243380+1200.1313315563380444000.00000023
>> 21:52:43,609 INFO AckListener$Empty:45 - Empty Ack Listener ended
>> 20110814-215243380+1200.1313315563380444000.00000023
>> 21:52:43,609 INFO NaiveFileWALManager:474 - File lives in
>> /tmp/flume-aaron/agent/agen/writing/20110814-215243380+1200.1313315563380444000.00000023
>> 21:52:43,610 INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting
>> down
>> 21:52:43,952 INFO WALAckManager:92 - Ack for
>> 20110814-214053048+1200.1313314853048500000.00000023 is queued to be checked
>> 21:52:43,952 INFO WALSource:97 - end of file NaiveFileWALManager
>> (dir=/tmp/flume-aaron/agent/agen )
>> 21:52:43,952 INFO NaiveFileWALManager:760 - opening log file
>> 20110814-215243380+1200.1313315563380444000.00000023
>> 21:52:44,110 ERROR DirectDriver:311 - Expected IDLE but timed out in
>> state ACTIVE
>> 21:52:44,111 INFO NaiveFileWALDeco:185 - Closing wal log, subsink still
>> making progress
>> 21:52:44,316 INFO WALAckManager:92 - Ack for
>> 20110814-215243380+1200.1313315563380444000.00000023 is queued to be checked
>> 21:52:44,316 INFO WALSource:97 - end of file NaiveFileWALManager
>> (dir=/tmp/flume-aaron/agent/agen )
>> 21:52:44,517 WARN NaiveFileWALManager:205 - Already shutting down, but
>> getting another shutting down notice, odd
>> 21:52:44,518 INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting
>> down
>> 21:52:44,518 INFO ThriftEventSink:89 - ThriftEventSink on port 35853 closed
>> 21:53:48,409 INFO WALAckManager:153 - Retransmitting
>> 20110814-215243380+1200.1313315563380444000.00000023 after being stale
>> for 64093ms
>> 21:53:48,409 INFO WALAckManager:153 - Retransmitting
>> 20110814-214053048+1200.1313314853048500000.00000023 after being stale
>> for 64457ms
>> 21:54:48,444 INFO WALAckManager:153 - Retransmitting
>> 20110814-215243380+1200.1313315563380444000.00000023 after being stale
>> for 60035ms
>> 21:54:48,444 WARN NaiveFileWALManager:789 - There was a race that
>> happend with SENT vs SENDING states
>> 21:54:48,444 INFO WALAckManager:153 - Retransmitting
>> 20110814-214053048+1200.1313314853048500000.00000023 after being stale
>> for 60035ms
>> 21:54:48,444 WARN NaiveFileWALManager:789 - There was a race that
>> happend with SENT vs SENDING states
>> 21:55:48,478 INFO WALAckManager:153 - Retransmitting
>> 20110814-215243380+1200.1313315563380444000.00000023 after being stale
>> for 60034ms
>> 21:55:48,479 WARN NaiveFileWALManager:789 - There was a race that
>> happend with SENT vs SENDING states
>>
>> I noticed nothing in the logs for coll.
>>
>>
>> My questions are:
>>
>> 1) Am I going about this test the right way ?
>>
>> 2) For the first case, should the agen process retry ? It sounds a
>> little like this https://issues.apache.org/jira/browse/FLUME-711
>>
>> 3) Are the "NaiveFileWALManager shutting down" messages as bad as the
>> sound ? I'm guessing not because they are INFO.
>>
>> 4) Is there a timeout to set to avoid "ERROR DirectDriver:311 - Expected
>> ACTIVE but timed out in state OPENING" ?
>>
>> 5) "INFO ThriftEventSink:89 - ThriftEventSink on port 35853 closed" I
>> checked lsof with the processes still running and saw
>> 202-126-206-214:tmp aaron$ lsof -i -P | grep 35853
>> java 44571 aaron 113u IPv6 0xffffff8017bec600 0t0 TCP *:35853 (LISTEN)
>>
>> 6) "WARN NaiveFileWALManager:789 - There was a race that happend with
>> SENT vs SENDING states" Is this a known issue ? Am I seeing this ?
>> https://issues.apache.org/jira/browse/FLUME-637
>>
>> 7) "INFO WALAckManager:153 - Retransmitting
>> 20110814-214053048+1200.1313314853048500000.00000023 after being stale
>> for 60035ms" Went on as along as I left the agen process running, there
>> was nothing been displayed on the coll though. Is this also part of
>> FLUME-637 ?
>>
>> Thanks for any help.
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
Re: race conditions and WAL shutdown
Posted by Mingjie Lai <mj...@gmail.com>.
Saw the problem before. The race condition warning really confused me at
the very beginning, but it turned out to be a configure issue. Please
take a look at these threads:
https://groups.google.com/a/cloudera.org/group/flume-user/browse_thread/thread/c35f07463c5c2721/0f63f323f066a827?lnk=gst&q=agente2esink+#0f63f323f066a827
https://groups.google.com/a/cloudera.org/group/flume-user/browse_thread/thread/64e786384928a505/f023c918081e8e48?lnk=gst&q=agente2esink+#f023c918081e8e48
In short, you need to use collector wrapper for agentE2ESink.
Thanks,
Mingjie
On 08/14/2011 03:18 AM, aaron morton wrote:
> I'm starting to play with Flume and running into some issues. I've found
> a couple of tickets that talk about similar things, but wanted to check
> if it was a problem on my side…
>
> Using 0.9.4 bin distro on OSX lion, all local, Java 1.6.0_26
>
> start a master and bin/flume node_nowatch -n agen and bin/flume
> node_nowatch -n coll
>
> Then add the following config via the web portal for the master (am
> using batch / unbatch as a test for aggregating data later)…
>
> agen : text("/my/file") | agentSink("localhost",35853) ;
> coll : collectorSource(35853) | batch(100) unbatch console ;
>
> On the agen node it goes bang and shuts it's self down…
>
> 21:40:53,064 INFO NaiveFileWALManager:179 - NaiveFileWALManager is now open
> 21:40:53,066 INFO InsistentOpenDecorator:128 - open attempt 0 failed,
> backoff (1000ms): Failed to open thrift event sink to localhost:35853 :
> java.net.ConnectException: Connection refused
> 21:40:54,063 ERROR DirectDriver:311 - Expected ACTIVE but timed out in
> state OPENING
> 21:40:54,069 INFO ThriftEventSink:112 - ThriftEventSink to
> localhost:35853 opened
> 21:40:54,069 INFO InsistentOpenDecorator:116 - Opened ThriftEventSink on
> try 1
> 21:40:54,069 INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting
> down
> 21:40:54,069 INFO ThriftEventSink:89 - ThriftEventSink on port 35853 closed
> 21:40:54,218 INFO TextFileSource:70 - File
> /Users/aaron/data/db-pedia/stats/pagecounts-20110625-010000-distributed-small
> closed
> 21:40:54,218 INFO RollSink:206 - closing RollSink 'ackingWal'
> 21:40:54,219 INFO SeqfileEventSink:92 - closed
> /tmp/flume-aaron/agent/agen/writing/20110814-214053048+1200.1313314853048500000.00000023
> 21:40:54,219 INFO AckListener$Empty:45 - Empty Ack Listener ended
> 20110814-214053048+1200.1313314853048500000.00000023
> 21:40:54,220 INFO NaiveFileWALManager:474 - File lives in
> /tmp/flume-aaron/agent/agen/writing/20110814-214053048+1200.1313314853048500000.00000023
> 21:40:54,220 WARN NaiveFileWALManager:205 - Already shutting down, but
> getting another shutting down notice, odd
> 21:40:54,221 INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting
> down
>
> I waited for a while, no retry. So killed both processes and started the
> coll then the agen process. The lines from the source file were
> delivered and this is what I see in the agen process…
>
> 21:52:43,397 INFO NaiveFileWALManager:760 - opening log file
> 20110814-214053048+1200.1313314853048500000.00000023
> 21:52:43,605 INFO TextFileSource:70 - File
> /Users/aaron/data/db-pedia/stats/pagecounts-20110625-010000-distributed-small
> closed
> 21:52:43,605 INFO RollSink:206 - closing RollSink 'ackingWal'
> 21:52:43,608 INFO SeqfileEventSink:92 - closed
> /tmp/flume-aaron/agent/agen/writing/20110814-215243380+1200.1313315563380444000.00000023
> 21:52:43,609 INFO AckListener$Empty:45 - Empty Ack Listener ended
> 20110814-215243380+1200.1313315563380444000.00000023
> 21:52:43,609 INFO NaiveFileWALManager:474 - File lives in
> /tmp/flume-aaron/agent/agen/writing/20110814-215243380+1200.1313315563380444000.00000023
> 21:52:43,610 INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting
> down
> 21:52:43,952 INFO WALAckManager:92 - Ack for
> 20110814-214053048+1200.1313314853048500000.00000023 is queued to be checked
> 21:52:43,952 INFO WALSource:97 - end of file NaiveFileWALManager
> (dir=/tmp/flume-aaron/agent/agen )
> 21:52:43,952 INFO NaiveFileWALManager:760 - opening log file
> 20110814-215243380+1200.1313315563380444000.00000023
> 21:52:44,110 ERROR DirectDriver:311 - Expected IDLE but timed out in
> state ACTIVE
> 21:52:44,111 INFO NaiveFileWALDeco:185 - Closing wal log, subsink still
> making progress
> 21:52:44,316 INFO WALAckManager:92 - Ack for
> 20110814-215243380+1200.1313315563380444000.00000023 is queued to be checked
> 21:52:44,316 INFO WALSource:97 - end of file NaiveFileWALManager
> (dir=/tmp/flume-aaron/agent/agen )
> 21:52:44,517 WARN NaiveFileWALManager:205 - Already shutting down, but
> getting another shutting down notice, odd
> 21:52:44,518 INFO NaiveFileWALManager:208 - NaiveFileWALManager shutting
> down
> 21:52:44,518 INFO ThriftEventSink:89 - ThriftEventSink on port 35853 closed
> 21:53:48,409 INFO WALAckManager:153 - Retransmitting
> 20110814-215243380+1200.1313315563380444000.00000023 after being stale
> for 64093ms
> 21:53:48,409 INFO WALAckManager:153 - Retransmitting
> 20110814-214053048+1200.1313314853048500000.00000023 after being stale
> for 64457ms
> 21:54:48,444 INFO WALAckManager:153 - Retransmitting
> 20110814-215243380+1200.1313315563380444000.00000023 after being stale
> for 60035ms
> 21:54:48,444 WARN NaiveFileWALManager:789 - There was a race that
> happend with SENT vs SENDING states
> 21:54:48,444 INFO WALAckManager:153 - Retransmitting
> 20110814-214053048+1200.1313314853048500000.00000023 after being stale
> for 60035ms
> 21:54:48,444 WARN NaiveFileWALManager:789 - There was a race that
> happend with SENT vs SENDING states
> 21:55:48,478 INFO WALAckManager:153 - Retransmitting
> 20110814-215243380+1200.1313315563380444000.00000023 after being stale
> for 60034ms
> 21:55:48,479 WARN NaiveFileWALManager:789 - There was a race that
> happend with SENT vs SENDING states
>
> I noticed nothing in the logs for coll.
>
>
> My questions are:
>
> 1) Am I going about this test the right way ?
>
> 2) For the first case, should the agen process retry ? It sounds a
> little like this https://issues.apache.org/jira/browse/FLUME-711
>
> 3) Are the "NaiveFileWALManager shutting down" messages as bad as the
> sound ? I'm guessing not because they are INFO.
>
> 4) Is there a timeout to set to avoid "ERROR DirectDriver:311 - Expected
> ACTIVE but timed out in state OPENING" ?
>
> 5) "INFO ThriftEventSink:89 - ThriftEventSink on port 35853 closed" I
> checked lsof with the processes still running and saw
> 202-126-206-214:tmp aaron$ lsof -i -P | grep 35853
> java 44571 aaron 113u IPv6 0xffffff8017bec600 0t0 TCP *:35853 (LISTEN)
>
> 6) "WARN NaiveFileWALManager:789 - There was a race that happend with
> SENT vs SENDING states" Is this a known issue ? Am I seeing this ?
> https://issues.apache.org/jira/browse/FLUME-637
>
> 7) "INFO WALAckManager:153 - Retransmitting
> 20110814-214053048+1200.1313314853048500000.00000023 after being stale
> for 60035ms" Went on as along as I left the agen process running, there
> was nothing been displayed on the coll though. Is this also part of
> FLUME-637 ?
>
> Thanks for any help.
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>