You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Alain RODRIGUEZ <ar...@gmail.com> on 2014/10/20 14:45:59 UTC

Repair hangs, seems to be stuck somehow

Hi,

Using Cassandra 1.2.18, we are experimenting an issue in our 2 DC
(EC2MultiRegionSnitch) C*1.2.18 cluster.

We have 2 DC and I saw some weird* inconsistencies between our 2 DC. I
tried to run repair on all the nodes of all 2 DC (We tried running various
repair at the same time and also in a rolling repair way, also tried with
and without -pr options). It spends days (last run started 3 days ago on
various machines), It seems to hang since I can't see any validation
compaction or any streams running. Though, I don't see any error either...
The CF I am trying to run right now is 350 MB large (per node), I am quite
sure it shouldn't take that long... Repairing other CF get also stuck.

The behaviour is quite strange since it seems to work at start (I see this
kind of logs : "INFO [AntiEntropyStage:1] 2014-10-18 06:01:58,991
AntiEntropyService.java (line 213) [repair
#44563cb0-568c-11e4-83c0-4dae0987c5d6] Received merkle tree for mytable
from /xxx.xxx.xxx.xxx", and I see some streams. But then load on nodes goes
down streams finish and there is no more validation. When I check my data
it appears I still have discrepancies, and "nodetool repair" command does t
not return.

I now that 2.1 fixes this all. We are going to migrate to C* 2.0 soon
(asap) and then to 2.1, but we first need to run some tests, which will
take us some time. Is repair officially broken on 1.2.18 ? Is there any
known workaround or solutions to get data repaired on this version ?

Any insight is very welcome. And if you need more information, let me know.

Alain

*That's weird since nodetool rebuild worked just fine on all the nodes
joining while building the new DC (except one that get stuck somehow, but
since I have a RF 3 and CL LOCAL_QUORUM, I should see the exact same result
on my 2 DC for a past value, not updated after the new DC joined the
cluster). Any idea why I have those discrepancies in first place ?

Re: Repair hangs, seems to be stuck somehow

Posted by Robert Coli <rc...@eventbrite.com>.

On Mon, Oct 20, 2014 at 5:45 AM, Alain RODRIGUEZ <ar...@gmail.com> wrote:

> Using Cassandra 1.2.18, we are experimenting an issue in our 2 DC
> (EC2MultiRegionSnitch) C*1.2.18 cluster.
>
> We have 2 DC and I saw some weird* inconsistencies between our 2 DC. I
> tried to run repair on all the nodes of all 2 DC (We tried running various
> repair at the same time and also in a rolling repair way, also tried with
> and without -pr options). It spends days (last run started 3 days ago on
> various machines), It seems to hang since I can't see any validation
> compaction or any streams running. Though, I don't see any error either...
> The CF I am trying to run right now is 350 MB large (per node), I am quite
> sure it shouldn't take that long... Repairing other CF get also stuck.
>

This is long standing issue which is unfortunately becoming a FAQ. Yes,
repair is broken in all versions of Cassandra up to at least 2.0.10,
hopefully the latest streaming rewrite will finally fix it. If you are
really overprovisioned and on real hardware and network and SSD, it might
work sometimes.

Here's some related JIRA...

https://issues.apache.org/jira/browse/CASSANDRA-3486 - nodetool command to
stop repair
https://issues.apache.org/jira/browse/CASSANDRA-7904 - good entry point
into web of 2.0 era repair bugs

and last but not least the inappropriately hostile but accurate...
https://issues.apache.org/jira/browse/CASSANDRA-5396

=Rob

Re: Repair hangs, seems to be stuck somehow

Posted by Alain RODRIGUEZ <ar...@gmail.com>.

I finally had to decommission this annoying node that was breaking repairs
again.

So far so good. It seems I solved the issue doing so.

Hope this will help some people out there.

Alain

2014-10-20 22:59 GMT+02:00 Alain RODRIGUEZ <ar...@gmail.com>:

> Hi guys.
>
> It seems that there were 2 streams hanging to one node, restarting this
> targeted node seems to have solved my issue, repairs are now running.
> Waiting to see if it completes.
>
> "Try repairing only one CF at a time, starting with the smallest ones
> and/or the ones whose data you care about the most?"
>
> Thanks Robert, I believe this is a good idea but I was doing it already.
>
> "If you are really overprovisioned and on real hardware and network and
> SSD, it might work sometimes."
>
> I am on AWS and was on m1.small to m1.xlarge from Cassandra 0.8 to 1.2.18,
> that's the first time a repair hangs for me. At least it is the first time
> I notices it. Maybe were I lucky somehow.
>
> Anyway, thanks for helping once again.
>
> Alain
>
> 2014-10-20 19:33 GMT+02:00 Robert Coli <rc...@eventbrite.com>:
>
>> On Mon, Oct 20, 2014 at 5:45 AM, Alain RODRIGUEZ <ar...@gmail.com>
>> wrote:
>>
>>> I now that 2.1 fixes this all. We are going to migrate to C* 2.0 soon
>>> (asap) and then to 2.1, but we first need to run some tests, which will
>>> take us some time. Is repair officially broken on 1.2.18 ? Is there any
>>> known workaround or solutions to get data repaired on this version ?
>>>
>>
>> One more thing :
>>
>> Try repairing only one CF at a time, starting with the smallest ones
>> and/or the ones whose data you care about the most?
>>
>> =Rob
>>
>>
>
>

Re: Repair hangs, seems to be stuck somehow

Posted by Alain RODRIGUEZ <ar...@gmail.com>.

Hi guys.

It seems that there were 2 streams hanging to one node, restarting this
targeted node seems to have solved my issue, repairs are now running.
Waiting to see if it completes.

"Try repairing only one CF at a time, starting with the smallest ones
and/or the ones whose data you care about the most?"

Thanks Robert, I believe this is a good idea but I was doing it already.

"If you are really overprovisioned and on real hardware and network and
SSD, it might work sometimes."

I am on AWS and was on m1.small to m1.xlarge from Cassandra 0.8 to 1.2.18,
that's the first time a repair hangs for me. At least it is the first time
I notices it. Maybe were I lucky somehow.

Anyway, thanks for helping once again.

Alain

2014-10-20 19:33 GMT+02:00 Robert Coli <rc...@eventbrite.com>:

> On Mon, Oct 20, 2014 at 5:45 AM, Alain RODRIGUEZ <ar...@gmail.com>
> wrote:
>
>> I now that 2.1 fixes this all. We are going to migrate to C* 2.0 soon
>> (asap) and then to 2.1, but we first need to run some tests, which will
>> take us some time. Is repair officially broken on 1.2.18 ? Is there any
>> known workaround or solutions to get data repaired on this version ?
>>
>
> One more thing :
>
> Try repairing only one CF at a time, starting with the smallest ones
> and/or the ones whose data you care about the most?
>
> =Rob
>
>

Re: Repair hangs, seems to be stuck somehow

Posted by Robert Coli <rc...@eventbrite.com>.

On Mon, Oct 20, 2014 at 5:45 AM, Alain RODRIGUEZ <ar...@gmail.com> wrote:

> I now that 2.1 fixes this all. We are going to migrate to C* 2.0 soon
> (asap) and then to 2.1, but we first need to run some tests, which will
> take us some time. Is repair officially broken on 1.2.18 ? Is there any
> known workaround or solutions to get data repaired on this version ?
>

One more thing :

Try repairing only one CF at a time, starting with the smallest ones and/or
the ones whose data you care about the most?

=Rob