You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@avro.apache.org by "James Baldassari (Created) (JIRA)" <ji...@apache.org> on 2012/01/30 01:03:12 UTC

[jira] [Created] (AVRO-1013) NettyTransceiver can hang after server restart

NettyTransceiver can hang after server restart
----------------------------------------------

                 Key: AVRO-1013
                 URL: https://issues.apache.org/jira/browse/AVRO-1013
             Project: Avro
          Issue Type: Bug
    Affects Versions: 1.6.1
            Reporter: James Baldassari
            Priority: Blocker


I ran into a very specific scenario today which can lead to NettyTransceiver hanging indefinitely:

# Start up a NettyServer
# Initialize a NettyTransceiver and SpecificRequestor
# Execute an RPC to establish the connection/handshake with the server
# Shut down the server
# Immediately execute another RPC

After Step 4, NettyTransceiver will detect that the connection has been closed and call NettyTransceiver#disconnect(boolean, boolean, Throwable), which sets 'remote' to null, indicating to Requestor that the NettyTransceiver is now disconnected.  However, if an RPC is executed just after the server has closed its socket (Step 5) and before disconnect() has been called, NettyTransceiver may still try to send this RPC because 'remote' has not yet been set to null.  This race condition is normally ok because NettyTransceiver#getChannel() will detect that the socket has been closed and then try to reestablish the connection.  Unfortunately, in this scenario getChannel() blocks forever when it attempts to acquire the write lock because the read lock has been acquired twice rather than once as getChannel() expects.  The read lock is acquired once by transceive(List<ByteBuffer>, Callback<List<ByteBuffer>>) and again by writeDataPack(NettyDataPack).

The fix is fairly simple.  The writeDataPack(NettyDataPack) method (which is private) does not acquire the read lock but specifies in its contract that the read lock must acquired before calling this method.  This change prevents the read lock from being acquired more than once by any single thread.  Another change is to have NettyTransceiver#isConnected() perform two checks instead of one: remote != null && isChannelReady(channel).  This second change should allow NettyTransceiver to detect disconnect events more quickly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-1013) NettyTransceiver can hang after server restart

Posted by "Doug Cutting (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13202679#comment-13202679 ] 

Doug Cutting commented on AVRO-1013:
------------------------------------

The new test passes for me even without the changes to NettyTransceiver.java.
                
> NettyTransceiver can hang after server restart
> ----------------------------------------------
>
>                 Key: AVRO-1013
>                 URL: https://issues.apache.org/jira/browse/AVRO-1013
>             Project: Avro
>          Issue Type: Bug
>    Affects Versions: 1.6.1
>            Reporter: James Baldassari
>            Priority: Blocker
>         Attachments: AVRO-1013.patch
>
>
> I ran into a very specific scenario today which can lead to NettyTransceiver hanging indefinitely:
> # Start up a NettyServer
> # Initialize a NettyTransceiver and SpecificRequestor
> # Execute an RPC to establish the connection/handshake with the server
> # Shut down the server
> # Immediately execute another RPC
> After Step 4, NettyTransceiver will detect that the connection has been closed and call NettyTransceiver#disconnect(boolean, boolean, Throwable), which sets 'remote' to null, indicating to Requestor that the NettyTransceiver is now disconnected.  However, if an RPC is executed just after the server has closed its socket (Step 5) and before disconnect() has been called, NettyTransceiver may still try to send this RPC because 'remote' has not yet been set to null.  This race condition is normally ok because NettyTransceiver#getChannel() will detect that the socket has been closed and then try to reestablish the connection.  Unfortunately, in this scenario getChannel() blocks forever when it attempts to acquire the write lock because the read lock has been acquired twice rather than once as getChannel() expects.  The read lock is acquired once by transceive(List<ByteBuffer>, Callback<List<ByteBuffer>>) and again by writeDataPack(NettyDataPack).
> The fix is fairly simple.  The writeDataPack(NettyDataPack) method (which is private) does not acquire the read lock but specifies in its contract that the read lock must acquired before calling this method.  This change prevents the read lock from being acquired more than once by any single thread.  Another change is to have NettyTransceiver#isConnected() perform two checks instead of one: remote != null && isChannelReady(channel).  This second change should allow NettyTransceiver to detect disconnect events more quickly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (AVRO-1013) NettyTransceiver can hang after server restart

Posted by "James Baldassari (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/AVRO-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

James Baldassari updated AVRO-1013:
-----------------------------------

    Attachment: AVRO-1013.patch

Here's a patch with a unit test.
                
> NettyTransceiver can hang after server restart
> ----------------------------------------------
>
>                 Key: AVRO-1013
>                 URL: https://issues.apache.org/jira/browse/AVRO-1013
>             Project: Avro
>          Issue Type: Bug
>    Affects Versions: 1.6.1
>            Reporter: James Baldassari
>            Priority: Blocker
>         Attachments: AVRO-1013.patch
>
>
> I ran into a very specific scenario today which can lead to NettyTransceiver hanging indefinitely:
> # Start up a NettyServer
> # Initialize a NettyTransceiver and SpecificRequestor
> # Execute an RPC to establish the connection/handshake with the server
> # Shut down the server
> # Immediately execute another RPC
> After Step 4, NettyTransceiver will detect that the connection has been closed and call NettyTransceiver#disconnect(boolean, boolean, Throwable), which sets 'remote' to null, indicating to Requestor that the NettyTransceiver is now disconnected.  However, if an RPC is executed just after the server has closed its socket (Step 5) and before disconnect() has been called, NettyTransceiver may still try to send this RPC because 'remote' has not yet been set to null.  This race condition is normally ok because NettyTransceiver#getChannel() will detect that the socket has been closed and then try to reestablish the connection.  Unfortunately, in this scenario getChannel() blocks forever when it attempts to acquire the write lock because the read lock has been acquired twice rather than once as getChannel() expects.  The read lock is acquired once by transceive(List<ByteBuffer>, Callback<List<ByteBuffer>>) and again by writeDataPack(NettyDataPack).
> The fix is fairly simple.  The writeDataPack(NettyDataPack) method (which is private) does not acquire the read lock but specifies in its contract that the read lock must acquired before calling this method.  This change prevents the read lock from being acquired more than once by any single thread.  Another change is to have NettyTransceiver#isConnected() perform two checks instead of one: remote != null && isChannelReady(channel).  This second change should allow NettyTransceiver to detect disconnect events more quickly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (AVRO-1013) NettyTransceiver can hang after server restart

Posted by "James Baldassari (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/AVRO-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

James Baldassari updated AVRO-1013:
-----------------------------------

    Component/s: java
       Assignee: James Baldassari
    
> NettyTransceiver can hang after server restart
> ----------------------------------------------
>
>                 Key: AVRO-1013
>                 URL: https://issues.apache.org/jira/browse/AVRO-1013
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.6.1
>            Reporter: James Baldassari
>            Assignee: James Baldassari
>            Priority: Blocker
>         Attachments: AVRO-1013.patch
>
>
> I ran into a very specific scenario today which can lead to NettyTransceiver hanging indefinitely:
> # Start up a NettyServer
> # Initialize a NettyTransceiver and SpecificRequestor
> # Execute an RPC to establish the connection/handshake with the server
> # Shut down the server
> # Immediately execute another RPC
> After Step 4, NettyTransceiver will detect that the connection has been closed and call NettyTransceiver#disconnect(boolean, boolean, Throwable), which sets 'remote' to null, indicating to Requestor that the NettyTransceiver is now disconnected.  However, if an RPC is executed just after the server has closed its socket (Step 5) and before disconnect() has been called, NettyTransceiver may still try to send this RPC because 'remote' has not yet been set to null.  This race condition is normally ok because NettyTransceiver#getChannel() will detect that the socket has been closed and then try to reestablish the connection.  Unfortunately, in this scenario getChannel() blocks forever when it attempts to acquire the write lock because the read lock has been acquired twice rather than once as getChannel() expects.  The read lock is acquired once by transceive(List<ByteBuffer>, Callback<List<ByteBuffer>>) and again by writeDataPack(NettyDataPack).
> The fix is fairly simple.  The writeDataPack(NettyDataPack) method (which is private) does not acquire the read lock but specifies in its contract that the read lock must acquired before calling this method.  This change prevents the read lock from being acquired more than once by any single thread.  Another change is to have NettyTransceiver#isConnected() perform two checks instead of one: remote != null && isChannelReady(channel).  This second change should allow NettyTransceiver to detect disconnect events more quickly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (AVRO-1013) NettyTransceiver can hang after server restart

Posted by "James Baldassari (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/AVRO-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

James Baldassari updated AVRO-1013:
-----------------------------------

       Resolution: Fixed
    Fix Version/s: 1.6.2
           Status: Resolved  (was: Patch Available)

Committed
                
> NettyTransceiver can hang after server restart
> ----------------------------------------------
>
>                 Key: AVRO-1013
>                 URL: https://issues.apache.org/jira/browse/AVRO-1013
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.6.1
>            Reporter: James Baldassari
>            Assignee: James Baldassari
>            Priority: Blocker
>             Fix For: 1.6.2
>
>         Attachments: AVRO-1013.patch
>
>
> I ran into a very specific scenario today which can lead to NettyTransceiver hanging indefinitely:
> # Start up a NettyServer
> # Initialize a NettyTransceiver and SpecificRequestor
> # Execute an RPC to establish the connection/handshake with the server
> # Shut down the server
> # Immediately execute another RPC
> After Step 4, NettyTransceiver will detect that the connection has been closed and call NettyTransceiver#disconnect(boolean, boolean, Throwable), which sets 'remote' to null, indicating to Requestor that the NettyTransceiver is now disconnected.  However, if an RPC is executed just after the server has closed its socket (Step 5) and before disconnect() has been called, NettyTransceiver may still try to send this RPC because 'remote' has not yet been set to null.  This race condition is normally ok because NettyTransceiver#getChannel() will detect that the socket has been closed and then try to reestablish the connection.  Unfortunately, in this scenario getChannel() blocks forever when it attempts to acquire the write lock because the read lock has been acquired twice rather than once as getChannel() expects.  The read lock is acquired once by transceive(List<ByteBuffer>, Callback<List<ByteBuffer>>) and again by writeDataPack(NettyDataPack).
> The fix is fairly simple.  The writeDataPack(NettyDataPack) method (which is private) does not acquire the read lock but specifies in its contract that the read lock must acquired before calling this method.  This change prevents the read lock from being acquired more than once by any single thread.  Another change is to have NettyTransceiver#isConnected() perform two checks instead of one: remote != null && isChannelReady(channel).  This second change should allow NettyTransceiver to detect disconnect events more quickly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (AVRO-1013) NettyTransceiver can hang after server restart

Posted by "James Baldassari (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/AVRO-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

James Baldassari updated AVRO-1013:
-----------------------------------

    Status: Patch Available  (was: Open)
    
> NettyTransceiver can hang after server restart
> ----------------------------------------------
>
>                 Key: AVRO-1013
>                 URL: https://issues.apache.org/jira/browse/AVRO-1013
>             Project: Avro
>          Issue Type: Bug
>    Affects Versions: 1.6.1
>            Reporter: James Baldassari
>            Priority: Blocker
>         Attachments: AVRO-1013.patch
>
>
> I ran into a very specific scenario today which can lead to NettyTransceiver hanging indefinitely:
> # Start up a NettyServer
> # Initialize a NettyTransceiver and SpecificRequestor
> # Execute an RPC to establish the connection/handshake with the server
> # Shut down the server
> # Immediately execute another RPC
> After Step 4, NettyTransceiver will detect that the connection has been closed and call NettyTransceiver#disconnect(boolean, boolean, Throwable), which sets 'remote' to null, indicating to Requestor that the NettyTransceiver is now disconnected.  However, if an RPC is executed just after the server has closed its socket (Step 5) and before disconnect() has been called, NettyTransceiver may still try to send this RPC because 'remote' has not yet been set to null.  This race condition is normally ok because NettyTransceiver#getChannel() will detect that the socket has been closed and then try to reestablish the connection.  Unfortunately, in this scenario getChannel() blocks forever when it attempts to acquire the write lock because the read lock has been acquired twice rather than once as getChannel() expects.  The read lock is acquired once by transceive(List<ByteBuffer>, Callback<List<ByteBuffer>>) and again by writeDataPack(NettyDataPack).
> The fix is fairly simple.  The writeDataPack(NettyDataPack) method (which is private) does not acquire the read lock but specifies in its contract that the read lock must acquired before calling this method.  This change prevents the read lock from being acquired more than once by any single thread.  Another change is to have NettyTransceiver#isConnected() perform two checks instead of one: remote != null && isChannelReady(channel).  This second change should allow NettyTransceiver to detect disconnect events more quickly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-1013) NettyTransceiver can hang after server restart

Posted by "James Baldassari (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13202693#comment-13202693 ] 

James Baldassari commented on AVRO-1013:
----------------------------------------

Yes, unfortunately I wasn't able to construct a test that triggered the bug in Avro.  I can make it happen reliably in my own application (which I'm not able to publish here), but for some reason I couldn't get it to happen with the Avro unit test.  After spending hours on it, I had to settle for a test that verifies that the hang _doesn't_ happen.  If you'd rather see a test that fails before the patch is applied, I can take another stab at it.
                
> NettyTransceiver can hang after server restart
> ----------------------------------------------
>
>                 Key: AVRO-1013
>                 URL: https://issues.apache.org/jira/browse/AVRO-1013
>             Project: Avro
>          Issue Type: Bug
>    Affects Versions: 1.6.1
>            Reporter: James Baldassari
>            Priority: Blocker
>         Attachments: AVRO-1013.patch
>
>
> I ran into a very specific scenario today which can lead to NettyTransceiver hanging indefinitely:
> # Start up a NettyServer
> # Initialize a NettyTransceiver and SpecificRequestor
> # Execute an RPC to establish the connection/handshake with the server
> # Shut down the server
> # Immediately execute another RPC
> After Step 4, NettyTransceiver will detect that the connection has been closed and call NettyTransceiver#disconnect(boolean, boolean, Throwable), which sets 'remote' to null, indicating to Requestor that the NettyTransceiver is now disconnected.  However, if an RPC is executed just after the server has closed its socket (Step 5) and before disconnect() has been called, NettyTransceiver may still try to send this RPC because 'remote' has not yet been set to null.  This race condition is normally ok because NettyTransceiver#getChannel() will detect that the socket has been closed and then try to reestablish the connection.  Unfortunately, in this scenario getChannel() blocks forever when it attempts to acquire the write lock because the read lock has been acquired twice rather than once as getChannel() expects.  The read lock is acquired once by transceive(List<ByteBuffer>, Callback<List<ByteBuffer>>) and again by writeDataPack(NettyDataPack).
> The fix is fairly simple.  The writeDataPack(NettyDataPack) method (which is private) does not acquire the read lock but specifies in its contract that the read lock must acquired before calling this method.  This change prevents the read lock from being acquired more than once by any single thread.  Another change is to have NettyTransceiver#isConnected() perform two checks instead of one: remote != null && isChannelReady(channel).  This second change should allow NettyTransceiver to detect disconnect events more quickly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-1013) NettyTransceiver can hang after server restart

Posted by "Doug Cutting (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13202707#comment-13202707 ] 

Doug Cutting commented on AVRO-1013:
------------------------------------

That's okay.  +1
                
> NettyTransceiver can hang after server restart
> ----------------------------------------------
>
>                 Key: AVRO-1013
>                 URL: https://issues.apache.org/jira/browse/AVRO-1013
>             Project: Avro
>          Issue Type: Bug
>    Affects Versions: 1.6.1
>            Reporter: James Baldassari
>            Priority: Blocker
>         Attachments: AVRO-1013.patch
>
>
> I ran into a very specific scenario today which can lead to NettyTransceiver hanging indefinitely:
> # Start up a NettyServer
> # Initialize a NettyTransceiver and SpecificRequestor
> # Execute an RPC to establish the connection/handshake with the server
> # Shut down the server
> # Immediately execute another RPC
> After Step 4, NettyTransceiver will detect that the connection has been closed and call NettyTransceiver#disconnect(boolean, boolean, Throwable), which sets 'remote' to null, indicating to Requestor that the NettyTransceiver is now disconnected.  However, if an RPC is executed just after the server has closed its socket (Step 5) and before disconnect() has been called, NettyTransceiver may still try to send this RPC because 'remote' has not yet been set to null.  This race condition is normally ok because NettyTransceiver#getChannel() will detect that the socket has been closed and then try to reestablish the connection.  Unfortunately, in this scenario getChannel() blocks forever when it attempts to acquire the write lock because the read lock has been acquired twice rather than once as getChannel() expects.  The read lock is acquired once by transceive(List<ByteBuffer>, Callback<List<ByteBuffer>>) and again by writeDataPack(NettyDataPack).
> The fix is fairly simple.  The writeDataPack(NettyDataPack) method (which is private) does not acquire the read lock but specifies in its contract that the read lock must acquired before calling this method.  This change prevents the read lock from being acquired more than once by any single thread.  Another change is to have NettyTransceiver#isConnected() perform two checks instead of one: remote != null && isChannelReady(channel).  This second change should allow NettyTransceiver to detect disconnect events more quickly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (AVRO-1013) NettyTransceiver can hang after server restart

Posted by "James Baldassari (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/AVRO-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13195894#comment-13195894 ] 

James Baldassari commented on AVRO-1013:
----------------------------------------

The second change I described to NettyServer#isConnected() actually broke a bunch of unit tests, so I'm just going to leave that method unchanged.
                
> NettyTransceiver can hang after server restart
> ----------------------------------------------
>
>                 Key: AVRO-1013
>                 URL: https://issues.apache.org/jira/browse/AVRO-1013
>             Project: Avro
>          Issue Type: Bug
>    Affects Versions: 1.6.1
>            Reporter: James Baldassari
>            Priority: Blocker
>
> I ran into a very specific scenario today which can lead to NettyTransceiver hanging indefinitely:
> # Start up a NettyServer
> # Initialize a NettyTransceiver and SpecificRequestor
> # Execute an RPC to establish the connection/handshake with the server
> # Shut down the server
> # Immediately execute another RPC
> After Step 4, NettyTransceiver will detect that the connection has been closed and call NettyTransceiver#disconnect(boolean, boolean, Throwable), which sets 'remote' to null, indicating to Requestor that the NettyTransceiver is now disconnected.  However, if an RPC is executed just after the server has closed its socket (Step 5) and before disconnect() has been called, NettyTransceiver may still try to send this RPC because 'remote' has not yet been set to null.  This race condition is normally ok because NettyTransceiver#getChannel() will detect that the socket has been closed and then try to reestablish the connection.  Unfortunately, in this scenario getChannel() blocks forever when it attempts to acquire the write lock because the read lock has been acquired twice rather than once as getChannel() expects.  The read lock is acquired once by transceive(List<ByteBuffer>, Callback<List<ByteBuffer>>) and again by writeDataPack(NettyDataPack).
> The fix is fairly simple.  The writeDataPack(NettyDataPack) method (which is private) does not acquire the read lock but specifies in its contract that the read lock must acquired before calling this method.  This change prevents the read lock from being acquired more than once by any single thread.  Another change is to have NettyTransceiver#isConnected() perform two checks instead of one: remote != null && isChannelReady(channel).  This second change should allow NettyTransceiver to detect disconnect events more quickly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira