You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "vin01 (JIRA)" <ji...@apache.org> on 2016/06/19 14:32:05 UTC

[jira] [Comment Edited] (CASSANDRA-11845) Hanging repair in cassandra 2.2.4

    [ https://issues.apache.org/jira/browse/CASSANDRA-11845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15338535#comment-15338535 ] 

vin01 edited comment on CASSANDRA-11845 at 6/19/16 2:31 PM:
------------------------------------------------------------

After gettint the exception :-

ERROR [STREAM-OUT-/NODE_IN_DC_1] 2016-06-19 08:36:10,187 StreamSession.java:524 - [Stream #80b94bf0-3611-11e6-a89a-87602fd2948b] Streaming error occurred
java.net.SocketException: Connection reset
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113) ~[na:1.8.0_72]
        at java.net.SocketOutputStream.write(SocketOutputStream.java:153) ~[na:1.8.0_72]

I went on to check network logs around that time.

At ASA firewall between DCs i can see lot of deny messages for some packets :-

%ASA-6-106015: Deny TCP (no connection) from [NODE_IN_DC_2]/7003 to [NODE_IN_DC_1]/45573 flags ACK  on interface inside

I think that's the reason for failure.

That deny message basically indicates an idle timeout, which lead to an ACK to be sent after connection was already removed from connection pool by firewall.

Does cassandra has something to handle such cases? some retry kind of mechanism?


was (Author: vin01):
At ASA firewall between DCs i can see lot of deny messages for some packets :-

%ASA-6-106015: Deny TCP (no connection) from [NODE_IN_DC_2]/7003 to [NODE_IN_DC_1]/45573 flags ACK  on interface inside

I think that's the reason for failure.

That deny message basically indicates an idle timeout, which lead to an ACK to be sent after connection was already removed from connection pool by firewall.

Does cassandra has something to handle such cases? some retry kind of mechanism?

> Hanging repair in cassandra 2.2.4
> ---------------------------------
>
>                 Key: CASSANDRA-11845
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11845
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Streaming and Messaging
>         Environment: Centos 6
>            Reporter: vin01
>            Priority: Minor
>         Attachments: cassandra-2.2.4.error.log
>
>
> So after increasing the streaming_timeout_in_ms value to 3 hours, i was able to avoid the socketTimeout errors i was getting earlier (https://issues.apAache.org/jira/browse/CASSANDRA-11826), but now the issue is repair just stays stuck.
> current status :-
> [2016-05-19 05:52:50,835] Repair session a0e590e1-1d99-11e6-9d63-b717b380ffdd for range (-3309358208555432808,-3279958773585646585] finished (progress: 54%)
> [2016-05-19 05:53:09,446] Repair session a0e590e3-1d99-11e6-9d63-b717b380ffdd for range (8149151263857514385,8181801084802729407] finished (progress: 55%)
> [2016-05-19 05:53:13,808] Repair session a0e5b7f1-1d99-11e6-9d63-b717b380ffdd for range (3372779397996730299,3381236471688156773] finished (progress: 55%)
> [2016-05-19 05:53:27,543] Repair session a0e5b7f3-1d99-11e6-9d63-b717b380ffdd for range (-4182952858113330342,-4157904914928848809] finished (progress: 55%)
> [2016-05-19 05:53:41,128] Repair session a0e5df00-1d99-11e6-9d63-b717b380ffdd for range (6499366179019889198,6523760493740195344] finished (progress: 55%)
> And its 10:46:25 Now, almost 5 hours since it has been stuck right there.
> Earlier i could see repair session going on in system.log but there are no logs coming in right now, all i get in logs is regular index summary redistribution logs.
> Last logs for repair i saw in logs :-
> INFO  [RepairJobTask:5] 2016-05-19 05:53:41,125 RepairJob.java:152 - [repair #a0e5df00-1d99-11e6-9d63-b717b380ffdd] TABLE_NAME is fully synced
> INFO  [RepairJobTask:5] 2016-05-19 05:53:41,126 RepairSession.java:279 - [repair #a0e5df00-1d99-11e6-9d63-b717b380ffdd] Session completed successfully
> INFO  [RepairJobTask:5] 2016-05-19 05:53:41,126 RepairRunnable.java:232 - Repair session a0e5df00-1d99-11e6-9d63-b717b380ffdd for range (6499366179019889198,6523760493740195344] finished
> Its an incremental repair, and in "nodetool netstats" output i can see logs like :-
> Repair e3055fb0-1d9d-11e6-9d63-b717b380ffdd
>     /Node-2
>         Receiving 8 files, 1093461 bytes total. Already received 8 files, 1093461 bytes total
>             /data/cassandra/data/KEYSPACE_NAME/TABLE_NAME-01ad9750723e11e4bfe0d3887930a87c/tmp-la-80872-big-Data.db 399475/399475 bytes(100%) received from idx:0/Node-2
>             /data/cassandra/data/KEYSPACE_NAME/TABLE_NAME-01ad9750723e11e4bfe0d3887930a87c/tmp-la-80879-big-Data.db 53809/53809 bytes(100%) received from idx:0/Node-2
>             /data/cassandra/data/KEYSPACE_NAME/TABLE_NAME-01ad9750723e11e4bfe0d3887930a87c/tmp-la-80878-big-Data.db 89955/89955 bytes(100%) received from idx:0/Node-2
>             /data/cassandra/data/KEYSPACE_NAME/TABLE_NAME-01ad9750723e11e4bfe0d3887930a87c/tmp-la-80881-big-Data.db 168790/168790 bytes(100%) received from idx:0/Node-2
>             /data/cassandra/data/KEYSPACE_NAME/TABLE_NAME-01ad9750723e11e4bfe0d3887930a87c/tmp-la-80886-big-Data.db 107785/107785 bytes(100%) received from idx:0/Node-2
>             /data/cassandra/data/KEYSPACE_NAME/TABLE_NAME-01ad9750723e11e4bfe0d3887930a87c/tmp-la-80880-big-Data.db 52889/52889 bytes(100%) received from idx:0/Node-2
>             /data/cassandra/data/KEYSPACE_NAME/TABLE_NAME-01ad9750723e11e4bfe0d3887930a87c/tmp-la-80884-big-Data.db 148882/148882 bytes(100%) received from idx:0/Node-2
>             /data/cassandra/data/KEYSPACE_NAME/TABLE_NAME-01ad9750723e11e4bfe0d3887930a87c/tmp-la-80883-big-Data.db 71876/71876 bytes(100%) received from idx:0/Node-2
>         Sending 5 files, 863321 bytes total. Already sent 5 files, 863321 bytes total
>             /data/cassandra/data/KEYSPACE_NAME/TABLE_NAME-01ad9750723e11e4bfe0d3887930a87c/la-73168-big-Data.db 161895/161895 bytes(100%) sent to idx:0/Node-2
>             /data/cassandra/data/KEYSPACE_NAME/TABLE_NAME-01ad9750723e11e4bfe0d3887930a87c/la-72604-big-Data.db 399865/399865 bytes(100%) sent to idx:0/Node-2
>             /data/cassandra/data/KEYSPACE_NAME/TABLE_NAME-01ad9750723e11e4bfe0d3887930a87c/la-73147-big-Data.db 149066/149066 bytes(100%) sent to idx:0/Node-2
>             /data/cassandra/data/KEYSPACE_NAME/TABLE_NAME-01ad9750723e11e4bfe0d3887930a87c/la-72682-big-Data.db 126000/126000 bytes(100%) sent to idx:0/Node-2
>             /data/cassandra/data/KEYSPACE_NAME/TABLE_NAME-01ad9750723e11e4bfe0d3887930a87c/la-73173-big-Data.db 26495/26495 bytes(100%) sent to idx:0/Node-2
> Repair c0c8af20-1d9c-11e6-9d63-b717b380ffdd
>     /Node-3
>         Receiving 11 files, 13896288 bytes total. Already received 11 files, 13896288 bytes total
>             /data/cassandra/data/KEYSPACE_NAME/TABLE_NAME-01ad9750723e11e4bfe0d3887930a87c/tmp-la-79186-big-Data.db 1598874/1598874 bytes(100%) received from idx:0/Node-3
>             /data/cassandra/data/KEYSPACE_NAME/TABLE_NAME-01ad9750723e11e4bfe0d3887930a87c/tmp-la-79196-big-Data.db 736365/736365 bytes(100%) received from idx:0/Node-3
>             /data/cassandra/data/KEYSPACE_NAME/TABLE_NAME-01ad9750723e11e4bfe0d3887930a87c/tmp-la-79197-big-Data.db 326558/326558 bytes(100%) received from idx:0/Node-3
>             /data/cassandra/data/KEYSPACE_NAME/TABLE_NAME-01ad9750723e11e4bfe0d3887930a87c/tmp-la-79187-big-Data.db 1484827/1484827 bytes(100%) received from idx:0/Node-3
>             /data/cassandra/data/KEYSPACE_NAME/TABLE_NAME-01ad9750723e11e4bfe0d3887930a87c/tmp-la-79180-big-Data.db 393636/393636 bytes(100%) received from idx:0/Node-3
>             /data/cassandra/data/KEYSPACE_NAME/TABLE_NAME-01ad9750723e11e4bfe0d3887930a87c/tmp-la-79184-big-Data.db 825459/825459 bytes(100%) received from idx:0/Node-3
>             /data/cassandra/data/KEYSPACE_NAME/TABLE_NAME-01ad9750723e11e4bfe0d3887930a87c/tmp-la-79188-big-Data.db 3568782/3568782 bytes(100%) received from idx:0/Node-3
>             /data/cassandra/data/KEYSPACE_NAME/TABLE_NAME-01ad9750723e11e4bfe0d3887930a87c/tmp-la-79182-big-Data.db 271222/271222 bytes(100%) received from idx:0/Node-3
>             /data/cassandra/data/KEYSPACE_NAME/TABLE_NAME-01ad9750723e11e4bfe0d3887930a87c/tmp-la-79193-big-Data.db 4315497/4315497 bytes(100%) received from idx:0/Node-3
>             /data/cassandra/data/KEYSPACE_NAME/TABLE_NAME-01ad9750723e11e4bfe0d3887930a87c/tmp-la-79183-big-Data.db 19775/19775 bytes(100%) received from idx:0/Node-3
>             /data/cassandra/data/KEYSPACE_NAME/TABLE_NAME-01ad9750723e11e4bfe0d3887930a87c/tmp-la-79192-big-Data.db 355293/355293 bytes(100%) received from idx:0/Node-3
>         Sending 5 files, 9444101 bytes total. Already sent 5 files, 9444101 bytes total
>             /data/cassandra/data/KEYSPACE_NAME/TABLE_NAME-01ad9750723e11e4bfe0d3887930a87c/la-73168-big-Data.db 1796825/1796825 bytes(100%) sent to idx:0/Node-3
>             /data/cassandra/data/KEYSPACE_NAME/TABLE_NAME-01ad9750723e11e4bfe0d3887930a87c/la-72604-big-Data.db 4549996/4549996 bytes(100%) sent to idx:0/Node-3
>             /data/cassandra/data/KEYSPACE_NAME/TABLE_NAME-01ad9750723e11e4bfe0d3887930a87c/la-73147-big-Data.db 1658881/1658881 bytes(100%) sent to idx:0/Node-3
>             /data/cassandra/data/KEYSPACE_NAME/TABLE_NAME-01ad9750723e11e4bfe0d3887930a87c/la-72682-big-Data.db 1418335/1418335 bytes(100%) sent to idx:0/Node-3
>             /data/cassandra/data/KEYSPACE_NAME/TABLE_NAME-01ad9750723e11e4bfe0d3887930a87c/la-73173-big-Data.db 20064/20064 bytes(100%) sent to idx:0/Node-3
> Read Repair Statistics:
> Attempted: 1142
> Mismatch (Blocking): 0
> Mismatch (Background): 0
> Pool Name                    Active   Pending      Completed
> Large messages                  n/a         0            779
> Small messages                  n/a         0       14756609
> Gossip messages                 n/a         0         119647
> The last three fields "Large messages" , "Small messages"  and "Gossip messages" keep changing, "Large messages" has incremented by 2 in last 5 hours, other 2 are changing more frequently.
> I am unable to figure out whether repair is going on or stuck.. If its stuck.. what should be my course of action if i want to get that table repaired?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)