You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Benjamin Roth (JIRA)" <ji...@apache.org> on 2016/08/08 08:40:20 UTC

[jira] [Reopened] (CASSANDRA-12280) nodetool repair hangs

     [ https://issues.apache.org/jira/browse/CASSANDRA-12280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Roth reopened CASSANDRA-12280:
---------------------------------------
    Reproduced In: 3.7, 3.0.8, 3.9  (was: 3.0.8)

I encountered this issue again (and again and again).
Last time it even happened with the keyspace mentioned in the issue description, which contains exactly 6 records in table "dislike" and nothing else.
There are currently no reads or writes on that keyspace. Other keypaces in the cluster are already in production, so the cluster itself is a bit busy but far from being overloaded.
We use reaper for queuing the repairs. The repair that hung was a parallel repair with a token range on the whole keyspace.
The repair can not be cancelled by JMX (or by reaper using JMX again), the JMX call also hangs. Only restarting all the nodes with hanging repair helps.
I don't see any logs indicating a hard error like broken pipe, timeouts, ...
tpstats shows the hanging repairs, no compactions are ongoing or pending. netstats shows 1 or 2 pending messages all the time but it is hard to tell if they belong to the hanging repair.

To me it somehow smells of a deadlock situation caused by a race condition. May that maybe relate to MVs? Maybe if a base table and a related MV are repaired at the same time?

Sometimes I saw in the logs sth like "Could not create snapshot". But it is not so easy to tell if that was a cause or an effect.

Are there any tools to dig deeper? More detailed logging? A way to get a trace of the repair thread? I mean there are not so many ways to "hang" either the thread is waiting for IO or it is locked. It should be quite easy to find out whats going on when seeing the BT.

> nodetool repair hangs
> ---------------------
>
>                 Key: CASSANDRA-12280
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12280
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Benjamin Roth
>
> nodetool repair hangs when repairing a keyspace, does not hang when repairting table/mv by table/mv.
> Command executed (both variants make it hang):
> nodetool repair likes like dislike_by_source_mv like_by_contact_mv match_valid_mv like_out dislike match match_by_contact_mv like_valid_mv like_out_by_source_mv
> OR
> nodetool repair likes
> Logs:
> https://gist.github.com/brstgt/bf8b20fa1942d29ab60926ede7340b75
> Nodetool output:
> https://gist.github.com/brstgt/3aa73662da4b0190630ac1aad6c90a6f
> Schema:
> https://gist.github.com/brstgt/3fd59e0166f86f8065085532e3638097



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)