You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flume.apache.org by "Will McQueen (Created) (JIRA)" <ji...@apache.org> on 2011/11/01 01:21:32 UTC

[jira] [Created] (FLUME-827) Avro client conn failure results in 60-second wait before terminating

Avro client conn failure results in 60-second wait before terminating
---------------------------------------------------------------------

                 Key: FLUME-827
                 URL: https://issues.apache.org/jira/browse/FLUME-827
             Project: Flume
          Issue Type: Bug
          Components: Node
    Affects Versions: NG alpha 1
            Reporter: Will McQueen


Launching the avro client when the Flume NG node isn't yet running will result in a conn refused error. The client should then shutdown immediately, but instead waits for 1 minute before the client is terminated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (FLUME-827) Avro client conn failure results in 60-second wait before terminating

Posted by "Will McQueen (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/FLUME-827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141640#comment-13141640 ] 

Will McQueen commented on FLUME-827:
------------------------------------

Okay, so it sounds like the client's retry policy should be configurable -- left to the client, as Ralph says. Since each agent can act as both a client and server (sink and source, respectively), then each sink and each source can have its own pluggable retry policy, as Eric suggests. The behavior of FLUME-823 can also be dictated by these policies. Policies can be grabbed from a local config file (eg, flume.properties), although for a large number of agents I can see value in some kind of central management of policies with a policy server that agents would grab their configs from.

Eric, for the original long design where clients send events to a group node that contains the retry policy (among other policies), if I understand correctly we'd have something like this:

Client ==> Group ==> Server

What would be the client's retry policy if it's now the group node disappears temporarily?
                
> Avro client conn failure results in 60-second wait before terminating
> ---------------------------------------------------------------------
>
>                 Key: FLUME-827
>                 URL: https://issues.apache.org/jira/browse/FLUME-827
>             Project: Flume
>          Issue Type: Bug
>          Components: Node
>    Affects Versions: NG alpha 1
>            Reporter: Will McQueen
>             Fix For: NG alpha 2
>
>
> Launching the avro client when the Flume NG node isn't yet running will result in a conn refused error. The client should then shutdown immediately, but instead waits for 1 minute before the client is terminated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (FLUME-827) Avro client conn failure results in 60-second wait before terminating

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/FLUME-827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161860#comment-13161860 ] 

jiraposter@reviews.apache.org commented on FLUME-827:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2995/#review3604
-----------------------------------------------------------

Ship it!


+1

One thing I noted during the review which is unrelated is that the sink implementation is not thread safe. It is not a concern right now, but if we decide to move to multi-threaded pollable sink runners, that will need to be addressed.

- Arvind


On 2011-12-02 20:23:44, Eric Sammer wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/2995/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-12-02 20:23:44)
bq.  
bq.  
bq.  Review request for Flume, Arvind Prabhakar and Prasad Mujumdar.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Properly detect RPC failures and reconnect as necessary.
bq.  
bq.  
bq.  This addresses bug FLUME-827.
bq.      https://issues.apache.org/jira/browse/FLUME-827
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    i/flume-ng-core/src/test/java/org/apache/flume/sink/TestAvroSink.java fead54f 
bq.    i/flume-ng-core/src/main/java/org/apache/flume/sink/AvroSink.java e1f381d 
bq.  
bq.  Diff: https://reviews.apache.org/r/2995/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  New unit tests that confirm this works given a simple connect, send, kill server, send, start server, reconnect, deliver use case.
bq.  Ran manual tests with a pair of agents handling data during transient failures. Works!
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Eric
bq.  
bq.


                
> Avro client conn failure results in 60-second wait before terminating
> ---------------------------------------------------------------------
>
>                 Key: FLUME-827
>                 URL: https://issues.apache.org/jira/browse/FLUME-827
>             Project: Flume
>          Issue Type: Bug
>          Components: Node
>    Affects Versions: NG alpha 1
>            Reporter: Will McQueen
>            Assignee: E. Sammer
>            Priority: Blocker
>             Fix For: NG alpha 2
>
>
> Launching the avro client when the Flume NG node isn't yet running will result in a conn refused error. The client should then shutdown immediately, but instead waits for 1 minute before the client is terminated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (FLUME-827) Avro client conn failure results in 60-second wait before terminating

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/FLUME-827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161873#comment-13161873 ] 

Hudson commented on FLUME-827:
------------------------------

Integrated in flume-728 #65 (See [https://builds.apache.org/job/flume-728/65/])
    FLUME-827: Avro client conn failure results in 60-second wait before terminating

esammer : http://svn.apache.org/viewvc/?view=rev&rev=1209693
Files : 
* /incubator/flume/branches/flume-728/flume-ng-core/src/main/java/org/apache/flume/sink/AvroSink.java
* /incubator/flume/branches/flume-728/flume-ng-core/src/test/java/org/apache/flume/sink/TestAvroSink.java

                
> Avro client conn failure results in 60-second wait before terminating
> ---------------------------------------------------------------------
>
>                 Key: FLUME-827
>                 URL: https://issues.apache.org/jira/browse/FLUME-827
>             Project: Flume
>          Issue Type: Bug
>          Components: Node
>    Affects Versions: NG alpha 1
>            Reporter: Will McQueen
>            Assignee: E. Sammer
>            Priority: Blocker
>             Fix For: NG alpha 2
>
>
> Launching the avro client when the Flume NG node isn't yet running will result in a conn refused error. The client should then shutdown immediately, but instead waits for 1 minute before the client is terminated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (FLUME-827) Avro client conn failure results in 60-second wait before terminating

Posted by "E. Sammer (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/FLUME-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

E. Sammer updated FLUME-827:
----------------------------

    Fix Version/s: NG alpha 2
    
> Avro client conn failure results in 60-second wait before terminating
> ---------------------------------------------------------------------
>
>                 Key: FLUME-827
>                 URL: https://issues.apache.org/jira/browse/FLUME-827
>             Project: Flume
>          Issue Type: Bug
>          Components: Node
>    Affects Versions: NG alpha 1
>            Reporter: Will McQueen
>             Fix For: NG alpha 2
>
>
> Launching the avro client when the Flume NG node isn't yet running will result in a conn refused error. The client should then shutdown immediately, but instead waits for 1 minute before the client is terminated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Closed] (FLUME-827) Avro client conn failure results in 60-second wait before terminating

Posted by "E. Sammer (Closed) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/FLUME-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

E. Sammer closed FLUME-827.
---------------------------

    
> Avro client conn failure results in 60-second wait before terminating
> ---------------------------------------------------------------------
>
>                 Key: FLUME-827
>                 URL: https://issues.apache.org/jira/browse/FLUME-827
>             Project: Flume
>          Issue Type: Bug
>          Components: Node
>    Affects Versions: NG alpha 1
>            Reporter: Will McQueen
>            Assignee: E. Sammer
>            Priority: Blocker
>             Fix For: NG alpha 2
>
>
> Launching the avro client when the Flume NG node isn't yet running will result in a conn refused error. The client should then shutdown immediately, but instead waits for 1 minute before the client is terminated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (FLUME-827) Avro client conn failure results in 60-second wait before terminating

Posted by "E. Sammer (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/FLUME-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

E. Sammer updated FLUME-827:
----------------------------

    Priority: Blocker  (was: Major)

Marking this a blocker. If the connection between the client and server breaks, the AvroSink currently doesn't not reconnect in any meaningful way. This breaks tiered collection.
                
> Avro client conn failure results in 60-second wait before terminating
> ---------------------------------------------------------------------
>
>                 Key: FLUME-827
>                 URL: https://issues.apache.org/jira/browse/FLUME-827
>             Project: Flume
>          Issue Type: Bug
>          Components: Node
>    Affects Versions: NG alpha 1
>            Reporter: Will McQueen
>            Priority: Blocker
>             Fix For: NG alpha 2
>
>
> Launching the avro client when the Flume NG node isn't yet running will result in a conn refused error. The client should then shutdown immediately, but instead waits for 1 minute before the client is terminated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (FLUME-827) Avro client conn failure results in 60-second wait before terminating

Posted by "E. Sammer (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/FLUME-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

E. Sammer reassigned FLUME-827:
-------------------------------

    Assignee: E. Sammer
    
> Avro client conn failure results in 60-second wait before terminating
> ---------------------------------------------------------------------
>
>                 Key: FLUME-827
>                 URL: https://issues.apache.org/jira/browse/FLUME-827
>             Project: Flume
>          Issue Type: Bug
>          Components: Node
>    Affects Versions: NG alpha 1
>            Reporter: Will McQueen
>            Assignee: E. Sammer
>            Priority: Blocker
>             Fix For: NG alpha 2
>
>
> Launching the avro client when the Flume NG node isn't yet running will result in a conn refused error. The client should then shutdown immediately, but instead waits for 1 minute before the client is terminated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (FLUME-827) Avro client conn failure results in 60-second wait before terminating

Posted by "Ralph Goers (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/FLUME-827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141181#comment-13141181 ] 

Ralph Goers commented on FLUME-827:
-----------------------------------

Retries should be left to the client. In my case, the Log4j2 client can be configured with multiple agents. If the first doesn't work it should fail over as fast as possible to its secondary agent.  I would expect this same behavior from Agents that are forwarding to other agents.  Ideally, the client should periodically check to see if the primary has come back and then switch back when it does.
                
> Avro client conn failure results in 60-second wait before terminating
> ---------------------------------------------------------------------
>
>                 Key: FLUME-827
>                 URL: https://issues.apache.org/jira/browse/FLUME-827
>             Project: Flume
>          Issue Type: Bug
>          Components: Node
>    Affects Versions: NG alpha 1
>            Reporter: Will McQueen
>             Fix For: NG alpha 2
>
>
> Launching the avro client when the Flume NG node isn't yet running will result in a conn refused error. The client should then shutdown immediately, but instead waits for 1 minute before the client is terminated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (FLUME-827) Avro client conn failure results in 60-second wait before terminating

Posted by "E. Sammer (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/FLUME-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

E. Sammer updated FLUME-827:
----------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

Committed to the flume-728 branch.
                
> Avro client conn failure results in 60-second wait before terminating
> ---------------------------------------------------------------------
>
>                 Key: FLUME-827
>                 URL: https://issues.apache.org/jira/browse/FLUME-827
>             Project: Flume
>          Issue Type: Bug
>          Components: Node
>    Affects Versions: NG alpha 1
>            Reporter: Will McQueen
>            Assignee: E. Sammer
>            Priority: Blocker
>             Fix For: NG alpha 2
>
>
> Launching the avro client when the Flume NG node isn't yet running will result in a conn refused error. The client should then shutdown immediately, but instead waits for 1 minute before the client is terminated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (FLUME-827) Avro client conn failure results in 60-second wait before terminating

Posted by "E. Sammer (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/FLUME-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

E. Sammer updated FLUME-827:
----------------------------

    Status: Patch Available  (was: Open)

Review up.
                
> Avro client conn failure results in 60-second wait before terminating
> ---------------------------------------------------------------------
>
>                 Key: FLUME-827
>                 URL: https://issues.apache.org/jira/browse/FLUME-827
>             Project: Flume
>          Issue Type: Bug
>          Components: Node
>    Affects Versions: NG alpha 1
>            Reporter: Will McQueen
>            Assignee: E. Sammer
>            Priority: Blocker
>             Fix For: NG alpha 2
>
>
> Launching the avro client when the Flume NG node isn't yet running will result in a conn refused error. The client should then shutdown immediately, but instead waits for 1 minute before the client is terminated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (FLUME-827) Avro client conn failure results in 60-second wait before terminating

Posted by "E. Sammer (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/FLUME-827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140952#comment-13140952 ] 

E. Sammer commented on FLUME-827:
---------------------------------

Will:

I was thinking about this and was thinking that the client should retry indefinitely (with warnings in the logs). The rationale is similar to that of FLUME-823. In this case, though, there's the potentially common possibility that the Avro server to which the client is connecting (e.g. another flume node in a tiered collection deployment) disappears temporarily (e.g. is restarted, a network partition). Ideally, the client would reestablish the connection, in this case, and continue transferring data. Thoughts?
                
> Avro client conn failure results in 60-second wait before terminating
> ---------------------------------------------------------------------
>
>                 Key: FLUME-827
>                 URL: https://issues.apache.org/jira/browse/FLUME-827
>             Project: Flume
>          Issue Type: Bug
>          Components: Node
>    Affects Versions: NG alpha 1
>            Reporter: Will McQueen
>             Fix For: NG alpha 2
>
>
> Launching the avro client when the Flume NG node isn't yet running will result in a conn refused error. The client should then shutdown immediately, but instead waits for 1 minute before the client is terminated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (FLUME-827) Avro client conn failure results in 60-second wait before terminating

Posted by "E. Sammer (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/FLUME-827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141472#comment-13141472 ] 

E. Sammer commented on FLUME-827:
---------------------------------

+1 to what Ralph said. There will likely be pluggable policies that dictate how clients (sinks) pick from servers (sources) in tiered collection deployments. Think of not just HA but also load balancing policies. My original long design for this was that clients would never send to servers, but instead send to a group (which could be made up of a single server) with a policy (round robin, sticky selection (what Ralph describes), weighted, etc.).
                
> Avro client conn failure results in 60-second wait before terminating
> ---------------------------------------------------------------------
>
>                 Key: FLUME-827
>                 URL: https://issues.apache.org/jira/browse/FLUME-827
>             Project: Flume
>          Issue Type: Bug
>          Components: Node
>    Affects Versions: NG alpha 1
>            Reporter: Will McQueen
>             Fix For: NG alpha 2
>
>
> Launching the avro client when the Flume NG node isn't yet running will result in a conn refused error. The client should then shutdown immediately, but instead waits for 1 minute before the client is terminated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (FLUME-827) Avro client conn failure results in 60-second wait before terminating

Posted by "E. Sammer (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/FLUME-827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149853#comment-13149853 ] 

E. Sammer commented on FLUME-827:
---------------------------------

Will:

The idea is that the group is expanded. It's a logical construct, not physical. In other words, it's config syntax sugar only. The user says clients (sinks) a, b, and c are in group 1 and servers (sources) x and y are also in group 1. Now, flume understands it can load balance sinks a, b, and c to sources x and y. This is the same idea as Flume OG's autoChain (except we'd load balance rather than stick to a single server although I suppose that could be another policy option). At least this was my thinking / idea.
                
> Avro client conn failure results in 60-second wait before terminating
> ---------------------------------------------------------------------
>
>                 Key: FLUME-827
>                 URL: https://issues.apache.org/jira/browse/FLUME-827
>             Project: Flume
>          Issue Type: Bug
>          Components: Node
>    Affects Versions: NG alpha 1
>            Reporter: Will McQueen
>             Fix For: NG alpha 2
>
>
> Launching the avro client when the Flume NG node isn't yet running will result in a conn refused error. The client should then shutdown immediately, but instead waits for 1 minute before the client is terminated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (FLUME-827) Avro client conn failure results in 60-second wait before terminating

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/FLUME-827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161837#comment-13161837 ] 

jiraposter@reviews.apache.org commented on FLUME-827:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2995/
-----------------------------------------------------------

Review request for Flume, Arvind Prabhakar and Prasad Mujumdar.


Summary
-------

Properly detect RPC failures and reconnect as necessary.


This addresses bug FLUME-827.
    https://issues.apache.org/jira/browse/FLUME-827


Diffs
-----

  i/flume-ng-core/src/test/java/org/apache/flume/sink/TestAvroSink.java fead54f 
  i/flume-ng-core/src/main/java/org/apache/flume/sink/AvroSink.java e1f381d 

Diff: https://reviews.apache.org/r/2995/diff


Testing
-------

New unit tests that confirm this works given a simple connect, send, kill server, send, start server, reconnect, deliver use case.
Ran manual tests with a pair of agents handling data during transient failures. Works!


Thanks,

Eric


                
> Avro client conn failure results in 60-second wait before terminating
> ---------------------------------------------------------------------
>
>                 Key: FLUME-827
>                 URL: https://issues.apache.org/jira/browse/FLUME-827
>             Project: Flume
>          Issue Type: Bug
>          Components: Node
>    Affects Versions: NG alpha 1
>            Reporter: Will McQueen
>            Assignee: E. Sammer
>            Priority: Blocker
>             Fix For: NG alpha 2
>
>
> Launching the avro client when the Flume NG node isn't yet running will result in a conn refused error. The client should then shutdown immediately, but instead waits for 1 minute before the client is terminated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira