You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Sammy Yu (JIRA)" <ji...@apache.org> on 2009/10/14 06:56:31 UTC

[jira] Created: (CASSANDRA-487) Message Serializer slows down/stops responding

Message Serializer slows down/stops responding
----------------------------------------------

                 Key: CASSANDRA-487
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-487
             Project: Cassandra
          Issue Type: Bug
    Affects Versions: 0.4
            Reporter: Sammy Yu


We ran into an issue with where the MESSAGE-SERIALIZER-POOL piles up with tasks.
$ /usr/sbin/nodeprobe -host localhost tpstats
FILEUTILS-DELETE-POOL, pending tasks=0
MESSAGING-SERVICE-POOL, pending tasks=0
MESSAGE-SERIALIZER-POOL, pending tasks=10785714
RESPONSE-STAGE, pending tasks=0
BOOT-STRAPPER, pending tasks=0
ROW-READ-STAGE, pending tasks=0
MESSAGE-DESERIALIZER-POOL, pending tasks=0
GMFD, pending tasks=0
LB-TARGET, pending tasks=0
CONSISTENCY-MANAGER, pending tasks=0
ROW-MUTATION-STAGE, pending tasks=0
MESSAGE-STREAMING-POOL, pending tasks=0
LOAD-BALANCER-STAGE, pending tasks=0
MEMTABLE-FLUSHER-POOL, pending tasks=0

In the log, this seems to have happened when we stopped 2 of the other nodes in our cluster.  This node will  time out on any thrift requests.  Looking through the logs we found the following two exceptions:
java.util.ConcurrentModificationException
        at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
        at java.util.AbstractList$Itr.next(AbstractList.java:349)
        at java.util.Collections.sort(Collections.java:120)
        at org.apache.cassandra.net.TcpConnectionManager.getLeastLoaded(TcpConnectionManager.java:108)
        at org.apache.cassandra.net.TcpConnectionManager.getConnection(TcpConnectionManager.java:71)
        at org.apache.cassandra.net.MessagingService.getConnection(MessagingService.java:306)
        at org.apache.cassandra.net.MessageSerializationTask.run(MessageSerializationTask.java:66)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)

java.util.NoSuchElementException
        at java.util.AbstractList$Itr.next(AbstractList.java:350)
        at java.util.Collections.sort(Collections.java:120)
        at org.apache.cassandra.net.TcpConnectionManager.getLeastLoaded(TcpConnectionManager.java:108)
        at org.apache.cassandra.net.TcpConnectionManager.getConnection(TcpConnectionManager.java:71)
        at org.apache.cassandra.net.MessagingService.getConnection(MessagingService.java:306)
        at org.apache.cassandra.net.MessageSerializationTask.run(MessageSerializationTask.java:66)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)

This appears to have happened on all 4 MESSAGE-SERIALIZER-POOL threads 
I will attach the complete log.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-487) Message Serializer slows down/stops responding

Posted by "Sammy Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12765398#action_12765398 ] 

Sammy Yu commented on CASSANDRA-487:
------------------------------------

I suspect this is because the TcpConnectionManager.removeConnection does not have a lock wrapped around it.


> Message Serializer slows down/stops responding
> ----------------------------------------------
>
>                 Key: CASSANDRA-487
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-487
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.4
>            Reporter: Sammy Yu
>         Attachments: system-487.log.gz
>
>
> We ran into an issue with where the MESSAGE-SERIALIZER-POOL piles up with tasks.
> $ /usr/sbin/nodeprobe -host localhost tpstats
> FILEUTILS-DELETE-POOL, pending tasks=0
> MESSAGING-SERVICE-POOL, pending tasks=0
> MESSAGE-SERIALIZER-POOL, pending tasks=10785714
> RESPONSE-STAGE, pending tasks=0
> BOOT-STRAPPER, pending tasks=0
> ROW-READ-STAGE, pending tasks=0
> MESSAGE-DESERIALIZER-POOL, pending tasks=0
> GMFD, pending tasks=0
> LB-TARGET, pending tasks=0
> CONSISTENCY-MANAGER, pending tasks=0
> ROW-MUTATION-STAGE, pending tasks=0
> MESSAGE-STREAMING-POOL, pending tasks=0
> LOAD-BALANCER-STAGE, pending tasks=0
> MEMTABLE-FLUSHER-POOL, pending tasks=0
> In the log, this seems to have happened when we stopped 2 of the other nodes in our cluster.  This node will  time out on any thrift requests.  Looking through the logs we found the following two exceptions:
> java.util.ConcurrentModificationException
>         at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
>         at java.util.AbstractList$Itr.next(AbstractList.java:349)
>         at java.util.Collections.sort(Collections.java:120)
>         at org.apache.cassandra.net.TcpConnectionManager.getLeastLoaded(TcpConnectionManager.java:108)
>         at org.apache.cassandra.net.TcpConnectionManager.getConnection(TcpConnectionManager.java:71)
>         at org.apache.cassandra.net.MessagingService.getConnection(MessagingService.java:306)
>         at org.apache.cassandra.net.MessageSerializationTask.run(MessageSerializationTask.java:66)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> java.util.NoSuchElementException
>         at java.util.AbstractList$Itr.next(AbstractList.java:350)
>         at java.util.Collections.sort(Collections.java:120)
>         at org.apache.cassandra.net.TcpConnectionManager.getLeastLoaded(TcpConnectionManager.java:108)
>         at org.apache.cassandra.net.TcpConnectionManager.getConnection(TcpConnectionManager.java:71)
>         at org.apache.cassandra.net.MessagingService.getConnection(MessagingService.java:306)
>         at org.apache.cassandra.net.MessageSerializationTask.run(MessageSerializationTask.java:66)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> This appears to have happened on all 4 MESSAGE-SERIALIZER-POOL threads 
> I will attach the complete log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-487) Message Serializer slows down/stops responding

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12765631#action_12765631 ] 

Jonathan Ellis commented on CASSANDRA-487:
------------------------------------------

One other race here (not in my patch):

            if (contains(connection))
            {
                return;
            }

needs to be inside the lock_.

> Message Serializer slows down/stops responding
> ----------------------------------------------
>
>                 Key: CASSANDRA-487
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-487
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.4
>            Reporter: Sammy Yu
>            Assignee: Sammy Yu
>             Fix For: 0.5
>
>         Attachments: 0001-Added-locks-around-remove-operation-so-that-Concurre.patch, 487-lock-all-connection-ops.patch, system-487.log.gz
>
>
> We ran into an issue with where the MESSAGE-SERIALIZER-POOL piles up with tasks.
> $ /usr/sbin/nodeprobe -host localhost tpstats
> FILEUTILS-DELETE-POOL, pending tasks=0
> MESSAGING-SERVICE-POOL, pending tasks=0
> MESSAGE-SERIALIZER-POOL, pending tasks=10785714
> RESPONSE-STAGE, pending tasks=0
> BOOT-STRAPPER, pending tasks=0
> ROW-READ-STAGE, pending tasks=0
> MESSAGE-DESERIALIZER-POOL, pending tasks=0
> GMFD, pending tasks=0
> LB-TARGET, pending tasks=0
> CONSISTENCY-MANAGER, pending tasks=0
> ROW-MUTATION-STAGE, pending tasks=0
> MESSAGE-STREAMING-POOL, pending tasks=0
> LOAD-BALANCER-STAGE, pending tasks=0
> MEMTABLE-FLUSHER-POOL, pending tasks=0
> In the log, this seems to have happened when we stopped 2 of the other nodes in our cluster.  This node will  time out on any thrift requests.  Looking through the logs we found the following two exceptions:
> java.util.ConcurrentModificationException
>         at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
>         at java.util.AbstractList$Itr.next(AbstractList.java:349)
>         at java.util.Collections.sort(Collections.java:120)
>         at org.apache.cassandra.net.TcpConnectionManager.getLeastLoaded(TcpConnectionManager.java:108)
>         at org.apache.cassandra.net.TcpConnectionManager.getConnection(TcpConnectionManager.java:71)
>         at org.apache.cassandra.net.MessagingService.getConnection(MessagingService.java:306)
>         at org.apache.cassandra.net.MessageSerializationTask.run(MessageSerializationTask.java:66)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> java.util.NoSuchElementException
>         at java.util.AbstractList$Itr.next(AbstractList.java:350)
>         at java.util.Collections.sort(Collections.java:120)
>         at org.apache.cassandra.net.TcpConnectionManager.getLeastLoaded(TcpConnectionManager.java:108)
>         at org.apache.cassandra.net.TcpConnectionManager.getConnection(TcpConnectionManager.java:71)
>         at org.apache.cassandra.net.MessagingService.getConnection(MessagingService.java:306)
>         at org.apache.cassandra.net.MessageSerializationTask.run(MessageSerializationTask.java:66)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> This appears to have happened on all 4 MESSAGE-SERIALIZER-POOL threads 
> I will attach the complete log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-487) Message Serializer slows down/stops responding

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-487:
-------------------------------------

    Fix Version/s:     (was: 0.5)
                   0.4

> Message Serializer slows down/stops responding
> ----------------------------------------------
>
>                 Key: CASSANDRA-487
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-487
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.4
>            Reporter: Sammy Yu
>            Assignee: Sammy Yu
>             Fix For: 0.4
>
>         Attachments: 0001-Added-locks-around-remove-operation-so-that-Concurre.patch, 487-lock-all-connection-ops.patch, system-487.log.gz
>
>
> We ran into an issue with where the MESSAGE-SERIALIZER-POOL piles up with tasks.
> $ /usr/sbin/nodeprobe -host localhost tpstats
> FILEUTILS-DELETE-POOL, pending tasks=0
> MESSAGING-SERVICE-POOL, pending tasks=0
> MESSAGE-SERIALIZER-POOL, pending tasks=10785714
> RESPONSE-STAGE, pending tasks=0
> BOOT-STRAPPER, pending tasks=0
> ROW-READ-STAGE, pending tasks=0
> MESSAGE-DESERIALIZER-POOL, pending tasks=0
> GMFD, pending tasks=0
> LB-TARGET, pending tasks=0
> CONSISTENCY-MANAGER, pending tasks=0
> ROW-MUTATION-STAGE, pending tasks=0
> MESSAGE-STREAMING-POOL, pending tasks=0
> LOAD-BALANCER-STAGE, pending tasks=0
> MEMTABLE-FLUSHER-POOL, pending tasks=0
> In the log, this seems to have happened when we stopped 2 of the other nodes in our cluster.  This node will  time out on any thrift requests.  Looking through the logs we found the following two exceptions:
> java.util.ConcurrentModificationException
>         at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
>         at java.util.AbstractList$Itr.next(AbstractList.java:349)
>         at java.util.Collections.sort(Collections.java:120)
>         at org.apache.cassandra.net.TcpConnectionManager.getLeastLoaded(TcpConnectionManager.java:108)
>         at org.apache.cassandra.net.TcpConnectionManager.getConnection(TcpConnectionManager.java:71)
>         at org.apache.cassandra.net.MessagingService.getConnection(MessagingService.java:306)
>         at org.apache.cassandra.net.MessageSerializationTask.run(MessageSerializationTask.java:66)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> java.util.NoSuchElementException
>         at java.util.AbstractList$Itr.next(AbstractList.java:350)
>         at java.util.Collections.sort(Collections.java:120)
>         at org.apache.cassandra.net.TcpConnectionManager.getLeastLoaded(TcpConnectionManager.java:108)
>         at org.apache.cassandra.net.TcpConnectionManager.getConnection(TcpConnectionManager.java:71)
>         at org.apache.cassandra.net.MessagingService.getConnection(MessagingService.java:306)
>         at org.apache.cassandra.net.MessageSerializationTask.run(MessageSerializationTask.java:66)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> This appears to have happened on all 4 MESSAGE-SERIALIZER-POOL threads 
> I will attach the complete log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-487) Message Serializer slows down/stops responding

Posted by "Sammy Yu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sammy Yu updated CASSANDRA-487:
-------------------------------

    Attachment: system-487.log.gz

> Message Serializer slows down/stops responding
> ----------------------------------------------
>
>                 Key: CASSANDRA-487
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-487
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.4
>            Reporter: Sammy Yu
>         Attachments: system-487.log.gz
>
>
> We ran into an issue with where the MESSAGE-SERIALIZER-POOL piles up with tasks.
> $ /usr/sbin/nodeprobe -host localhost tpstats
> FILEUTILS-DELETE-POOL, pending tasks=0
> MESSAGING-SERVICE-POOL, pending tasks=0
> MESSAGE-SERIALIZER-POOL, pending tasks=10785714
> RESPONSE-STAGE, pending tasks=0
> BOOT-STRAPPER, pending tasks=0
> ROW-READ-STAGE, pending tasks=0
> MESSAGE-DESERIALIZER-POOL, pending tasks=0
> GMFD, pending tasks=0
> LB-TARGET, pending tasks=0
> CONSISTENCY-MANAGER, pending tasks=0
> ROW-MUTATION-STAGE, pending tasks=0
> MESSAGE-STREAMING-POOL, pending tasks=0
> LOAD-BALANCER-STAGE, pending tasks=0
> MEMTABLE-FLUSHER-POOL, pending tasks=0
> In the log, this seems to have happened when we stopped 2 of the other nodes in our cluster.  This node will  time out on any thrift requests.  Looking through the logs we found the following two exceptions:
> java.util.ConcurrentModificationException
>         at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
>         at java.util.AbstractList$Itr.next(AbstractList.java:349)
>         at java.util.Collections.sort(Collections.java:120)
>         at org.apache.cassandra.net.TcpConnectionManager.getLeastLoaded(TcpConnectionManager.java:108)
>         at org.apache.cassandra.net.TcpConnectionManager.getConnection(TcpConnectionManager.java:71)
>         at org.apache.cassandra.net.MessagingService.getConnection(MessagingService.java:306)
>         at org.apache.cassandra.net.MessageSerializationTask.run(MessageSerializationTask.java:66)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> java.util.NoSuchElementException
>         at java.util.AbstractList$Itr.next(AbstractList.java:350)
>         at java.util.Collections.sort(Collections.java:120)
>         at org.apache.cassandra.net.TcpConnectionManager.getLeastLoaded(TcpConnectionManager.java:108)
>         at org.apache.cassandra.net.TcpConnectionManager.getConnection(TcpConnectionManager.java:71)
>         at org.apache.cassandra.net.MessagingService.getConnection(MessagingService.java:306)
>         at org.apache.cassandra.net.MessageSerializationTask.run(MessageSerializationTask.java:66)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> This appears to have happened on all 4 MESSAGE-SERIALIZER-POOL threads 
> I will attach the complete log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-487) Message Serializer slows down/stops responding

Posted by "Sammy Yu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sammy Yu updated CASSANDRA-487:
-------------------------------

    Fix Version/s: 0.4

> Message Serializer slows down/stops responding
> ----------------------------------------------
>
>                 Key: CASSANDRA-487
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-487
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.4
>            Reporter: Sammy Yu
>            Assignee: Sammy Yu
>             Fix For: 0.4
>
>         Attachments: system-487.log.gz
>
>
> We ran into an issue with where the MESSAGE-SERIALIZER-POOL piles up with tasks.
> $ /usr/sbin/nodeprobe -host localhost tpstats
> FILEUTILS-DELETE-POOL, pending tasks=0
> MESSAGING-SERVICE-POOL, pending tasks=0
> MESSAGE-SERIALIZER-POOL, pending tasks=10785714
> RESPONSE-STAGE, pending tasks=0
> BOOT-STRAPPER, pending tasks=0
> ROW-READ-STAGE, pending tasks=0
> MESSAGE-DESERIALIZER-POOL, pending tasks=0
> GMFD, pending tasks=0
> LB-TARGET, pending tasks=0
> CONSISTENCY-MANAGER, pending tasks=0
> ROW-MUTATION-STAGE, pending tasks=0
> MESSAGE-STREAMING-POOL, pending tasks=0
> LOAD-BALANCER-STAGE, pending tasks=0
> MEMTABLE-FLUSHER-POOL, pending tasks=0
> In the log, this seems to have happened when we stopped 2 of the other nodes in our cluster.  This node will  time out on any thrift requests.  Looking through the logs we found the following two exceptions:
> java.util.ConcurrentModificationException
>         at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
>         at java.util.AbstractList$Itr.next(AbstractList.java:349)
>         at java.util.Collections.sort(Collections.java:120)
>         at org.apache.cassandra.net.TcpConnectionManager.getLeastLoaded(TcpConnectionManager.java:108)
>         at org.apache.cassandra.net.TcpConnectionManager.getConnection(TcpConnectionManager.java:71)
>         at org.apache.cassandra.net.MessagingService.getConnection(MessagingService.java:306)
>         at org.apache.cassandra.net.MessageSerializationTask.run(MessageSerializationTask.java:66)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> java.util.NoSuchElementException
>         at java.util.AbstractList$Itr.next(AbstractList.java:350)
>         at java.util.Collections.sort(Collections.java:120)
>         at org.apache.cassandra.net.TcpConnectionManager.getLeastLoaded(TcpConnectionManager.java:108)
>         at org.apache.cassandra.net.TcpConnectionManager.getConnection(TcpConnectionManager.java:71)
>         at org.apache.cassandra.net.MessagingService.getConnection(MessagingService.java:306)
>         at org.apache.cassandra.net.MessageSerializationTask.run(MessageSerializationTask.java:66)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> This appears to have happened on all 4 MESSAGE-SERIALIZER-POOL threads 
> I will attach the complete log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-487) Message Serializer slows down/stops responding

Posted by "Sammy Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12765619#action_12765619 ] 

Sammy Yu commented on CASSANDRA-487:
------------------------------------

I've done some light testing, but we'll stress it out some more today.

> Message Serializer slows down/stops responding
> ----------------------------------------------
>
>                 Key: CASSANDRA-487
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-487
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.4
>            Reporter: Sammy Yu
>            Assignee: Sammy Yu
>             Fix For: 0.5
>
>         Attachments: 0001-Added-locks-around-remove-operation-so-that-Concurre.patch, system-487.log.gz
>
>
> We ran into an issue with where the MESSAGE-SERIALIZER-POOL piles up with tasks.
> $ /usr/sbin/nodeprobe -host localhost tpstats
> FILEUTILS-DELETE-POOL, pending tasks=0
> MESSAGING-SERVICE-POOL, pending tasks=0
> MESSAGE-SERIALIZER-POOL, pending tasks=10785714
> RESPONSE-STAGE, pending tasks=0
> BOOT-STRAPPER, pending tasks=0
> ROW-READ-STAGE, pending tasks=0
> MESSAGE-DESERIALIZER-POOL, pending tasks=0
> GMFD, pending tasks=0
> LB-TARGET, pending tasks=0
> CONSISTENCY-MANAGER, pending tasks=0
> ROW-MUTATION-STAGE, pending tasks=0
> MESSAGE-STREAMING-POOL, pending tasks=0
> LOAD-BALANCER-STAGE, pending tasks=0
> MEMTABLE-FLUSHER-POOL, pending tasks=0
> In the log, this seems to have happened when we stopped 2 of the other nodes in our cluster.  This node will  time out on any thrift requests.  Looking through the logs we found the following two exceptions:
> java.util.ConcurrentModificationException
>         at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
>         at java.util.AbstractList$Itr.next(AbstractList.java:349)
>         at java.util.Collections.sort(Collections.java:120)
>         at org.apache.cassandra.net.TcpConnectionManager.getLeastLoaded(TcpConnectionManager.java:108)
>         at org.apache.cassandra.net.TcpConnectionManager.getConnection(TcpConnectionManager.java:71)
>         at org.apache.cassandra.net.MessagingService.getConnection(MessagingService.java:306)
>         at org.apache.cassandra.net.MessageSerializationTask.run(MessageSerializationTask.java:66)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> java.util.NoSuchElementException
>         at java.util.AbstractList$Itr.next(AbstractList.java:350)
>         at java.util.Collections.sort(Collections.java:120)
>         at org.apache.cassandra.net.TcpConnectionManager.getLeastLoaded(TcpConnectionManager.java:108)
>         at org.apache.cassandra.net.TcpConnectionManager.getConnection(TcpConnectionManager.java:71)
>         at org.apache.cassandra.net.MessagingService.getConnection(MessagingService.java:306)
>         at org.apache.cassandra.net.MessageSerializationTask.run(MessageSerializationTask.java:66)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> This appears to have happened on all 4 MESSAGE-SERIALIZER-POOL threads 
> I will attach the complete log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-487) Message Serializer slows down/stops responding

Posted by "Jun Rao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12765614#action_12765614 ] 

Jun Rao commented on CASSANDRA-487:
-----------------------------------

The patch looks good to me. Have your tried it in your deployment?

> Message Serializer slows down/stops responding
> ----------------------------------------------
>
>                 Key: CASSANDRA-487
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-487
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.4
>            Reporter: Sammy Yu
>            Assignee: Sammy Yu
>             Fix For: 0.5
>
>         Attachments: 0001-Added-locks-around-remove-operation-so-that-Concurre.patch, system-487.log.gz
>
>
> We ran into an issue with where the MESSAGE-SERIALIZER-POOL piles up with tasks.
> $ /usr/sbin/nodeprobe -host localhost tpstats
> FILEUTILS-DELETE-POOL, pending tasks=0
> MESSAGING-SERVICE-POOL, pending tasks=0
> MESSAGE-SERIALIZER-POOL, pending tasks=10785714
> RESPONSE-STAGE, pending tasks=0
> BOOT-STRAPPER, pending tasks=0
> ROW-READ-STAGE, pending tasks=0
> MESSAGE-DESERIALIZER-POOL, pending tasks=0
> GMFD, pending tasks=0
> LB-TARGET, pending tasks=0
> CONSISTENCY-MANAGER, pending tasks=0
> ROW-MUTATION-STAGE, pending tasks=0
> MESSAGE-STREAMING-POOL, pending tasks=0
> LOAD-BALANCER-STAGE, pending tasks=0
> MEMTABLE-FLUSHER-POOL, pending tasks=0
> In the log, this seems to have happened when we stopped 2 of the other nodes in our cluster.  This node will  time out on any thrift requests.  Looking through the logs we found the following two exceptions:
> java.util.ConcurrentModificationException
>         at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
>         at java.util.AbstractList$Itr.next(AbstractList.java:349)
>         at java.util.Collections.sort(Collections.java:120)
>         at org.apache.cassandra.net.TcpConnectionManager.getLeastLoaded(TcpConnectionManager.java:108)
>         at org.apache.cassandra.net.TcpConnectionManager.getConnection(TcpConnectionManager.java:71)
>         at org.apache.cassandra.net.MessagingService.getConnection(MessagingService.java:306)
>         at org.apache.cassandra.net.MessageSerializationTask.run(MessageSerializationTask.java:66)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> java.util.NoSuchElementException
>         at java.util.AbstractList$Itr.next(AbstractList.java:350)
>         at java.util.Collections.sort(Collections.java:120)
>         at org.apache.cassandra.net.TcpConnectionManager.getLeastLoaded(TcpConnectionManager.java:108)
>         at org.apache.cassandra.net.TcpConnectionManager.getConnection(TcpConnectionManager.java:71)
>         at org.apache.cassandra.net.MessagingService.getConnection(MessagingService.java:306)
>         at org.apache.cassandra.net.MessageSerializationTask.run(MessageSerializationTask.java:66)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> This appears to have happened on all 4 MESSAGE-SERIALIZER-POOL threads 
> I will attach the complete log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-487) Message Serializer slows down/stops responding

Posted by "Sammy Yu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sammy Yu updated CASSANDRA-487:
-------------------------------

    Attachment: 0001-Added-locks-around-remove-operation-so-that-Concurre.patch

This looks like it exists in 0.5 as well.  I've wrapped the removeConnections method by using locks.  We may forgo this by using CopyOnWriteArrayList, but it may be too expensive.


> Message Serializer slows down/stops responding
> ----------------------------------------------
>
>                 Key: CASSANDRA-487
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-487
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.4
>            Reporter: Sammy Yu
>            Assignee: Sammy Yu
>             Fix For: 0.5
>
>         Attachments: 0001-Added-locks-around-remove-operation-so-that-Concurre.patch, system-487.log.gz
>
>
> We ran into an issue with where the MESSAGE-SERIALIZER-POOL piles up with tasks.
> $ /usr/sbin/nodeprobe -host localhost tpstats
> FILEUTILS-DELETE-POOL, pending tasks=0
> MESSAGING-SERVICE-POOL, pending tasks=0
> MESSAGE-SERIALIZER-POOL, pending tasks=10785714
> RESPONSE-STAGE, pending tasks=0
> BOOT-STRAPPER, pending tasks=0
> ROW-READ-STAGE, pending tasks=0
> MESSAGE-DESERIALIZER-POOL, pending tasks=0
> GMFD, pending tasks=0
> LB-TARGET, pending tasks=0
> CONSISTENCY-MANAGER, pending tasks=0
> ROW-MUTATION-STAGE, pending tasks=0
> MESSAGE-STREAMING-POOL, pending tasks=0
> LOAD-BALANCER-STAGE, pending tasks=0
> MEMTABLE-FLUSHER-POOL, pending tasks=0
> In the log, this seems to have happened when we stopped 2 of the other nodes in our cluster.  This node will  time out on any thrift requests.  Looking through the logs we found the following two exceptions:
> java.util.ConcurrentModificationException
>         at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
>         at java.util.AbstractList$Itr.next(AbstractList.java:349)
>         at java.util.Collections.sort(Collections.java:120)
>         at org.apache.cassandra.net.TcpConnectionManager.getLeastLoaded(TcpConnectionManager.java:108)
>         at org.apache.cassandra.net.TcpConnectionManager.getConnection(TcpConnectionManager.java:71)
>         at org.apache.cassandra.net.MessagingService.getConnection(MessagingService.java:306)
>         at org.apache.cassandra.net.MessageSerializationTask.run(MessageSerializationTask.java:66)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> java.util.NoSuchElementException
>         at java.util.AbstractList$Itr.next(AbstractList.java:350)
>         at java.util.Collections.sort(Collections.java:120)
>         at org.apache.cassandra.net.TcpConnectionManager.getLeastLoaded(TcpConnectionManager.java:108)
>         at org.apache.cassandra.net.TcpConnectionManager.getConnection(TcpConnectionManager.java:71)
>         at org.apache.cassandra.net.MessagingService.getConnection(MessagingService.java:306)
>         at org.apache.cassandra.net.MessageSerializationTask.run(MessageSerializationTask.java:66)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> This appears to have happened on all 4 MESSAGE-SERIALIZER-POOL threads 
> I will attach the complete log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (CASSANDRA-487) Message Serializer slows down/stops responding

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12765631#action_12765631 ] 

Jonathan Ellis edited comment on CASSANDRA-487 at 10/14/09 10:04 AM:
---------------------------------------------------------------------

One other race in addToPool (not in my patch):

            if (contains(connection))
            {
                return;
            }

needs to be inside the lock_.

      was (Author: jbellis):
    One other race here (not in my patch):

            if (contains(connection))
            {
                return;
            }

needs to be inside the lock_.
  
> Message Serializer slows down/stops responding
> ----------------------------------------------
>
>                 Key: CASSANDRA-487
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-487
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.4
>            Reporter: Sammy Yu
>            Assignee: Sammy Yu
>             Fix For: 0.5
>
>         Attachments: 0001-Added-locks-around-remove-operation-so-that-Concurre.patch, 487-lock-all-connection-ops.patch, system-487.log.gz
>
>
> We ran into an issue with where the MESSAGE-SERIALIZER-POOL piles up with tasks.
> $ /usr/sbin/nodeprobe -host localhost tpstats
> FILEUTILS-DELETE-POOL, pending tasks=0
> MESSAGING-SERVICE-POOL, pending tasks=0
> MESSAGE-SERIALIZER-POOL, pending tasks=10785714
> RESPONSE-STAGE, pending tasks=0
> BOOT-STRAPPER, pending tasks=0
> ROW-READ-STAGE, pending tasks=0
> MESSAGE-DESERIALIZER-POOL, pending tasks=0
> GMFD, pending tasks=0
> LB-TARGET, pending tasks=0
> CONSISTENCY-MANAGER, pending tasks=0
> ROW-MUTATION-STAGE, pending tasks=0
> MESSAGE-STREAMING-POOL, pending tasks=0
> LOAD-BALANCER-STAGE, pending tasks=0
> MEMTABLE-FLUSHER-POOL, pending tasks=0
> In the log, this seems to have happened when we stopped 2 of the other nodes in our cluster.  This node will  time out on any thrift requests.  Looking through the logs we found the following two exceptions:
> java.util.ConcurrentModificationException
>         at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
>         at java.util.AbstractList$Itr.next(AbstractList.java:349)
>         at java.util.Collections.sort(Collections.java:120)
>         at org.apache.cassandra.net.TcpConnectionManager.getLeastLoaded(TcpConnectionManager.java:108)
>         at org.apache.cassandra.net.TcpConnectionManager.getConnection(TcpConnectionManager.java:71)
>         at org.apache.cassandra.net.MessagingService.getConnection(MessagingService.java:306)
>         at org.apache.cassandra.net.MessageSerializationTask.run(MessageSerializationTask.java:66)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> java.util.NoSuchElementException
>         at java.util.AbstractList$Itr.next(AbstractList.java:350)
>         at java.util.Collections.sort(Collections.java:120)
>         at org.apache.cassandra.net.TcpConnectionManager.getLeastLoaded(TcpConnectionManager.java:108)
>         at org.apache.cassandra.net.TcpConnectionManager.getConnection(TcpConnectionManager.java:71)
>         at org.apache.cassandra.net.MessagingService.getConnection(MessagingService.java:306)
>         at org.apache.cassandra.net.MessageSerializationTask.run(MessageSerializationTask.java:66)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> This appears to have happened on all 4 MESSAGE-SERIALIZER-POOL threads 
> I will attach the complete log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (CASSANDRA-487) Message Serializer slows down/stops responding

Posted by "Sammy Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12765558#action_12765558 ] 

Sammy Yu edited comment on CASSANDRA-487 at 10/14/09 6:22 AM:
--------------------------------------------------------------

This looks like it exists in 0.5 as well.  Attached patch applies for 0.5.  I've wrapped the removeConnections method by using locks.  We may forgo this by using CopyOnWriteArrayList, but it may be too expensive.


      was (Author: sammy.yu):
    This looks like it exists in 0.5 as well.  I've wrapped the removeConnections method by using locks.  We may forgo this by using CopyOnWriteArrayList, but it may be too expensive.

  
> Message Serializer slows down/stops responding
> ----------------------------------------------
>
>                 Key: CASSANDRA-487
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-487
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.4
>            Reporter: Sammy Yu
>            Assignee: Sammy Yu
>             Fix For: 0.5
>
>         Attachments: 0001-Added-locks-around-remove-operation-so-that-Concurre.patch, system-487.log.gz
>
>
> We ran into an issue with where the MESSAGE-SERIALIZER-POOL piles up with tasks.
> $ /usr/sbin/nodeprobe -host localhost tpstats
> FILEUTILS-DELETE-POOL, pending tasks=0
> MESSAGING-SERVICE-POOL, pending tasks=0
> MESSAGE-SERIALIZER-POOL, pending tasks=10785714
> RESPONSE-STAGE, pending tasks=0
> BOOT-STRAPPER, pending tasks=0
> ROW-READ-STAGE, pending tasks=0
> MESSAGE-DESERIALIZER-POOL, pending tasks=0
> GMFD, pending tasks=0
> LB-TARGET, pending tasks=0
> CONSISTENCY-MANAGER, pending tasks=0
> ROW-MUTATION-STAGE, pending tasks=0
> MESSAGE-STREAMING-POOL, pending tasks=0
> LOAD-BALANCER-STAGE, pending tasks=0
> MEMTABLE-FLUSHER-POOL, pending tasks=0
> In the log, this seems to have happened when we stopped 2 of the other nodes in our cluster.  This node will  time out on any thrift requests.  Looking through the logs we found the following two exceptions:
> java.util.ConcurrentModificationException
>         at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
>         at java.util.AbstractList$Itr.next(AbstractList.java:349)
>         at java.util.Collections.sort(Collections.java:120)
>         at org.apache.cassandra.net.TcpConnectionManager.getLeastLoaded(TcpConnectionManager.java:108)
>         at org.apache.cassandra.net.TcpConnectionManager.getConnection(TcpConnectionManager.java:71)
>         at org.apache.cassandra.net.MessagingService.getConnection(MessagingService.java:306)
>         at org.apache.cassandra.net.MessageSerializationTask.run(MessageSerializationTask.java:66)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> java.util.NoSuchElementException
>         at java.util.AbstractList$Itr.next(AbstractList.java:350)
>         at java.util.Collections.sort(Collections.java:120)
>         at org.apache.cassandra.net.TcpConnectionManager.getLeastLoaded(TcpConnectionManager.java:108)
>         at org.apache.cassandra.net.TcpConnectionManager.getConnection(TcpConnectionManager.java:71)
>         at org.apache.cassandra.net.MessagingService.getConnection(MessagingService.java:306)
>         at org.apache.cassandra.net.MessageSerializationTask.run(MessageSerializationTask.java:66)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> This appears to have happened on all 4 MESSAGE-SERIALIZER-POOL threads 
> I will attach the complete log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-487) Message Serializer slows down/stops responding

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12765664#action_12765664 ] 

Jonathan Ellis commented on CASSANDRA-487:
------------------------------------------

Created CASSANDRA-488 for more deep fixes to the TcpConnManager area.

> Message Serializer slows down/stops responding
> ----------------------------------------------
>
>                 Key: CASSANDRA-487
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-487
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.4
>            Reporter: Sammy Yu
>            Assignee: Sammy Yu
>             Fix For: 0.5
>
>         Attachments: 0001-Added-locks-around-remove-operation-so-that-Concurre.patch, 487-lock-all-connection-ops.patch, system-487.log.gz
>
>
> We ran into an issue with where the MESSAGE-SERIALIZER-POOL piles up with tasks.
> $ /usr/sbin/nodeprobe -host localhost tpstats
> FILEUTILS-DELETE-POOL, pending tasks=0
> MESSAGING-SERVICE-POOL, pending tasks=0
> MESSAGE-SERIALIZER-POOL, pending tasks=10785714
> RESPONSE-STAGE, pending tasks=0
> BOOT-STRAPPER, pending tasks=0
> ROW-READ-STAGE, pending tasks=0
> MESSAGE-DESERIALIZER-POOL, pending tasks=0
> GMFD, pending tasks=0
> LB-TARGET, pending tasks=0
> CONSISTENCY-MANAGER, pending tasks=0
> ROW-MUTATION-STAGE, pending tasks=0
> MESSAGE-STREAMING-POOL, pending tasks=0
> LOAD-BALANCER-STAGE, pending tasks=0
> MEMTABLE-FLUSHER-POOL, pending tasks=0
> In the log, this seems to have happened when we stopped 2 of the other nodes in our cluster.  This node will  time out on any thrift requests.  Looking through the logs we found the following two exceptions:
> java.util.ConcurrentModificationException
>         at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
>         at java.util.AbstractList$Itr.next(AbstractList.java:349)
>         at java.util.Collections.sort(Collections.java:120)
>         at org.apache.cassandra.net.TcpConnectionManager.getLeastLoaded(TcpConnectionManager.java:108)
>         at org.apache.cassandra.net.TcpConnectionManager.getConnection(TcpConnectionManager.java:71)
>         at org.apache.cassandra.net.MessagingService.getConnection(MessagingService.java:306)
>         at org.apache.cassandra.net.MessageSerializationTask.run(MessageSerializationTask.java:66)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> java.util.NoSuchElementException
>         at java.util.AbstractList$Itr.next(AbstractList.java:350)
>         at java.util.Collections.sort(Collections.java:120)
>         at org.apache.cassandra.net.TcpConnectionManager.getLeastLoaded(TcpConnectionManager.java:108)
>         at org.apache.cassandra.net.TcpConnectionManager.getConnection(TcpConnectionManager.java:71)
>         at org.apache.cassandra.net.MessagingService.getConnection(MessagingService.java:306)
>         at org.apache.cassandra.net.MessageSerializationTask.run(MessageSerializationTask.java:66)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> This appears to have happened on all 4 MESSAGE-SERIALIZER-POOL threads 
> I will attach the complete log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-487) Message Serializer slows down/stops responding

Posted by "Sammy Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12765831#action_12765831 ] 

Sammy Yu commented on CASSANDRA-487:
------------------------------------

Tested jbellis' patch in production-like environment with normal operating state and restarted multiple nodes.


> Message Serializer slows down/stops responding
> ----------------------------------------------
>
>                 Key: CASSANDRA-487
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-487
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.4
>            Reporter: Sammy Yu
>            Assignee: Sammy Yu
>             Fix For: 0.5
>
>         Attachments: 0001-Added-locks-around-remove-operation-so-that-Concurre.patch, 487-lock-all-connection-ops.patch, system-487.log.gz
>
>
> We ran into an issue with where the MESSAGE-SERIALIZER-POOL piles up with tasks.
> $ /usr/sbin/nodeprobe -host localhost tpstats
> FILEUTILS-DELETE-POOL, pending tasks=0
> MESSAGING-SERVICE-POOL, pending tasks=0
> MESSAGE-SERIALIZER-POOL, pending tasks=10785714
> RESPONSE-STAGE, pending tasks=0
> BOOT-STRAPPER, pending tasks=0
> ROW-READ-STAGE, pending tasks=0
> MESSAGE-DESERIALIZER-POOL, pending tasks=0
> GMFD, pending tasks=0
> LB-TARGET, pending tasks=0
> CONSISTENCY-MANAGER, pending tasks=0
> ROW-MUTATION-STAGE, pending tasks=0
> MESSAGE-STREAMING-POOL, pending tasks=0
> LOAD-BALANCER-STAGE, pending tasks=0
> MEMTABLE-FLUSHER-POOL, pending tasks=0
> In the log, this seems to have happened when we stopped 2 of the other nodes in our cluster.  This node will  time out on any thrift requests.  Looking through the logs we found the following two exceptions:
> java.util.ConcurrentModificationException
>         at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
>         at java.util.AbstractList$Itr.next(AbstractList.java:349)
>         at java.util.Collections.sort(Collections.java:120)
>         at org.apache.cassandra.net.TcpConnectionManager.getLeastLoaded(TcpConnectionManager.java:108)
>         at org.apache.cassandra.net.TcpConnectionManager.getConnection(TcpConnectionManager.java:71)
>         at org.apache.cassandra.net.MessagingService.getConnection(MessagingService.java:306)
>         at org.apache.cassandra.net.MessageSerializationTask.run(MessageSerializationTask.java:66)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> java.util.NoSuchElementException
>         at java.util.AbstractList$Itr.next(AbstractList.java:350)
>         at java.util.Collections.sort(Collections.java:120)
>         at org.apache.cassandra.net.TcpConnectionManager.getLeastLoaded(TcpConnectionManager.java:108)
>         at org.apache.cassandra.net.TcpConnectionManager.getConnection(TcpConnectionManager.java:71)
>         at org.apache.cassandra.net.MessagingService.getConnection(MessagingService.java:306)
>         at org.apache.cassandra.net.MessageSerializationTask.run(MessageSerializationTask.java:66)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> This appears to have happened on all 4 MESSAGE-SERIALIZER-POOL threads 
> I will attach the complete log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-487) Message Serializer slows down/stops responding

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-487:
-------------------------------------

    Attachment: 487-lock-all-connection-ops.patch

the problem is that the original code tries to "cheat" and rely on Vector's built-in synchronization for one-line ops.  but as this exception shows, even that has problems since operations like Collections.sort aren't synchronized (even though superfically it looks like a one-liner).

here is a more general patch that turns the Vector into an ArrayList and always does explicit locking of the collection, to remove the temptation to cheat like that.

> Message Serializer slows down/stops responding
> ----------------------------------------------
>
>                 Key: CASSANDRA-487
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-487
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.4
>            Reporter: Sammy Yu
>            Assignee: Sammy Yu
>             Fix For: 0.5
>
>         Attachments: 0001-Added-locks-around-remove-operation-so-that-Concurre.patch, 487-lock-all-connection-ops.patch, system-487.log.gz
>
>
> We ran into an issue with where the MESSAGE-SERIALIZER-POOL piles up with tasks.
> $ /usr/sbin/nodeprobe -host localhost tpstats
> FILEUTILS-DELETE-POOL, pending tasks=0
> MESSAGING-SERVICE-POOL, pending tasks=0
> MESSAGE-SERIALIZER-POOL, pending tasks=10785714
> RESPONSE-STAGE, pending tasks=0
> BOOT-STRAPPER, pending tasks=0
> ROW-READ-STAGE, pending tasks=0
> MESSAGE-DESERIALIZER-POOL, pending tasks=0
> GMFD, pending tasks=0
> LB-TARGET, pending tasks=0
> CONSISTENCY-MANAGER, pending tasks=0
> ROW-MUTATION-STAGE, pending tasks=0
> MESSAGE-STREAMING-POOL, pending tasks=0
> LOAD-BALANCER-STAGE, pending tasks=0
> MEMTABLE-FLUSHER-POOL, pending tasks=0
> In the log, this seems to have happened when we stopped 2 of the other nodes in our cluster.  This node will  time out on any thrift requests.  Looking through the logs we found the following two exceptions:
> java.util.ConcurrentModificationException
>         at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
>         at java.util.AbstractList$Itr.next(AbstractList.java:349)
>         at java.util.Collections.sort(Collections.java:120)
>         at org.apache.cassandra.net.TcpConnectionManager.getLeastLoaded(TcpConnectionManager.java:108)
>         at org.apache.cassandra.net.TcpConnectionManager.getConnection(TcpConnectionManager.java:71)
>         at org.apache.cassandra.net.MessagingService.getConnection(MessagingService.java:306)
>         at org.apache.cassandra.net.MessageSerializationTask.run(MessageSerializationTask.java:66)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> java.util.NoSuchElementException
>         at java.util.AbstractList$Itr.next(AbstractList.java:350)
>         at java.util.Collections.sort(Collections.java:120)
>         at org.apache.cassandra.net.TcpConnectionManager.getLeastLoaded(TcpConnectionManager.java:108)
>         at org.apache.cassandra.net.TcpConnectionManager.getConnection(TcpConnectionManager.java:71)
>         at org.apache.cassandra.net.MessagingService.getConnection(MessagingService.java:306)
>         at org.apache.cassandra.net.MessageSerializationTask.run(MessageSerializationTask.java:66)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> This appears to have happened on all 4 MESSAGE-SERIALIZER-POOL threads 
> I will attach the complete log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (CASSANDRA-487) Message Serializer slows down/stops responding

Posted by "Sammy Yu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sammy Yu reassigned CASSANDRA-487:
----------------------------------

    Assignee: Sammy Yu

> Message Serializer slows down/stops responding
> ----------------------------------------------
>
>                 Key: CASSANDRA-487
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-487
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.4
>            Reporter: Sammy Yu
>            Assignee: Sammy Yu
>         Attachments: system-487.log.gz
>
>
> We ran into an issue with where the MESSAGE-SERIALIZER-POOL piles up with tasks.
> $ /usr/sbin/nodeprobe -host localhost tpstats
> FILEUTILS-DELETE-POOL, pending tasks=0
> MESSAGING-SERVICE-POOL, pending tasks=0
> MESSAGE-SERIALIZER-POOL, pending tasks=10785714
> RESPONSE-STAGE, pending tasks=0
> BOOT-STRAPPER, pending tasks=0
> ROW-READ-STAGE, pending tasks=0
> MESSAGE-DESERIALIZER-POOL, pending tasks=0
> GMFD, pending tasks=0
> LB-TARGET, pending tasks=0
> CONSISTENCY-MANAGER, pending tasks=0
> ROW-MUTATION-STAGE, pending tasks=0
> MESSAGE-STREAMING-POOL, pending tasks=0
> LOAD-BALANCER-STAGE, pending tasks=0
> MEMTABLE-FLUSHER-POOL, pending tasks=0
> In the log, this seems to have happened when we stopped 2 of the other nodes in our cluster.  This node will  time out on any thrift requests.  Looking through the logs we found the following two exceptions:
> java.util.ConcurrentModificationException
>         at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
>         at java.util.AbstractList$Itr.next(AbstractList.java:349)
>         at java.util.Collections.sort(Collections.java:120)
>         at org.apache.cassandra.net.TcpConnectionManager.getLeastLoaded(TcpConnectionManager.java:108)
>         at org.apache.cassandra.net.TcpConnectionManager.getConnection(TcpConnectionManager.java:71)
>         at org.apache.cassandra.net.MessagingService.getConnection(MessagingService.java:306)
>         at org.apache.cassandra.net.MessageSerializationTask.run(MessageSerializationTask.java:66)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> java.util.NoSuchElementException
>         at java.util.AbstractList$Itr.next(AbstractList.java:350)
>         at java.util.Collections.sort(Collections.java:120)
>         at org.apache.cassandra.net.TcpConnectionManager.getLeastLoaded(TcpConnectionManager.java:108)
>         at org.apache.cassandra.net.TcpConnectionManager.getConnection(TcpConnectionManager.java:71)
>         at org.apache.cassandra.net.MessagingService.getConnection(MessagingService.java:306)
>         at org.apache.cassandra.net.MessageSerializationTask.run(MessageSerializationTask.java:66)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:619)
> This appears to have happened on all 4 MESSAGE-SERIALIZER-POOL threads 
> I will attach the complete log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.