You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Paul Pollack <pa...@klaviyo.com> on 2017/08/31 01:39:04 UTC

Cassandra 3.7 repair error messages

Hi,

I'm trying to run a repair on a node my Cassandra cluster, version 3.7, and
was hoping someone may be able to shed light on an error message that keeps
cropping up.

I started the repair on a node after discovering that it somehow became
partitioned from the rest of the cluster, e.g. nodetool status on all other
nodes showed it as DN, and on the node itself showed all other nodes as DN.
After restarting the Cassandra daemon the node seemed to re-join the
cluster just fine, so I began a repair.

The repair has been running for about 33 hours (first incremental repair on
this cluster), and every so often I'll see a line like this:

[2017-08-31 00:18:16,300] Repair session
f7ae4e71-8ce3-11e7-b466-79eba0383e4f for range
[(-5606588017314999649,-5604469721630340065],
(9047587767449433379,9047652965163017217]] failed with error Endpoint /
20.0.122.204 died (progress: 9%)

Every one of these lines refers to the same node, 20.0.122.204.

I'm mostly looking for guidance here. Do these errors indicate that the
entire repair will be worthless, or just for token ranges shared by these
two nodes? Is it normal to see error messages of this nature and for a
repair not to terminate?

Thanks,
Paul

Re: Cassandra 3.7 repair error messages

Posted by Paul Pollack <pa...@klaviyo.com>.

Thanks Erick, and sorry it took me so long to respond, I had to turn my
attention to other things. It definitely looks like there had been some
network blips going on with that node for a while before we saw it marked
down from every other node's perspective. Additionally, my original comment
that all of the failure messages referred to the same node was incorrect,
it seems like every few hours it would start to log messages for other
nodes in turn.

I went through the logs on all of the other nodes that were reported failed
from .204's perspective and found that they all failed to create a merkle
tree. We decided to set the consistency level for reads on this cluster to
quorum, which has at least prevented any data inconsistencies and as far as
we can tell no noticeable performance loss.

To answer your last question, I did once successfully run a repair on a
different node. It ran in about 12 hours or so.

I think before I dig further into why this repair could not run to
completion I have to address some other issues with the cluster -- namely
that we're hitting the Amazon EBS throughput cap on the the data volumes
for our nodes, which is causing our disk queue length to get big and
cluster-wide throughput to tank.

Thanks again for your help,
Paul

On Wed, Aug 30, 2017 at 9:54 PM, Erick Ramirez <fl...@gmail.com> wrote:

> No, it isn't normal for sessions to fail and you will need to investigate.
> You need to review the logs on node .204 to determine why the session
> failed. For example, did it timeout because of a very large sstable? Or did
> the connection get truncated after a while?
>
> You will need to address the cause of those failures. It could be external
> to the nodes, e.g. firewall closing the socket so you might need to
> configure TCP keep_alive. 33 hours sounds like a really long time. Have you
> successfully run a repair on this cluster before?
>
> On Thu, Aug 31, 2017 at 11:39 AM, Paul Pollack <pa...@klaviyo.com>
> wrote:
>
>> Hi,
>>
>> I'm trying to run a repair on a node my Cassandra cluster, version 3.7,
>> and was hoping someone may be able to shed light on an error message that
>> keeps cropping up.
>>
>> I started the repair on a node after discovering that it somehow became
>> partitioned from the rest of the cluster, e.g. nodetool status on all other
>> nodes showed it as DN, and on the node itself showed all other nodes as DN.
>> After restarting the Cassandra daemon the node seemed to re-join the
>> cluster just fine, so I began a repair.
>>
>> The repair has been running for about 33 hours (first incremental repair
>> on this cluster), and every so often I'll see a line like this:
>>
>> [2017-08-31 00:18:16,300] Repair session f7ae4e71-8ce3-11e7-b466-79eba0383e4f
>> for range [(-5606588017314999649,-5604469721630340065],
>> (9047587767449433379,9047652965163017217]] failed with error Endpoint /
>> 20.0.122.204 died (progress: 9%)
>>
>> Every one of these lines refers to the same node, 20.0.122.204.
>>
>> I'm mostly looking for guidance here. Do these errors indicate that the
>> entire repair will be worthless, or just for token ranges shared by these
>> two nodes? Is it normal to see error messages of this nature and for a
>> repair not to terminate?
>>
>> Thanks,
>> Paul
>>
>
>

Re: Cassandra 3.7 repair error messages

Posted by Erick Ramirez <fl...@gmail.com>.

No, it isn't normal for sessions to fail and you will need to investigate.
You need to review the logs on node .204 to determine why the session
failed. For example, did it timeout because of a very large sstable? Or did
the connection get truncated after a while?

You will need to address the cause of those failures. It could be external
to the nodes, e.g. firewall closing the socket so you might need to
configure TCP keep_alive. 33 hours sounds like a really long time. Have you
successfully run a repair on this cluster before?

On Thu, Aug 31, 2017 at 11:39 AM, Paul Pollack <pa...@klaviyo.com>
wrote:

> Hi,
>
> I'm trying to run a repair on a node my Cassandra cluster, version 3.7,
> and was hoping someone may be able to shed light on an error message that
> keeps cropping up.
>
> I started the repair on a node after discovering that it somehow became
> partitioned from the rest of the cluster, e.g. nodetool status on all other
> nodes showed it as DN, and on the node itself showed all other nodes as DN.
> After restarting the Cassandra daemon the node seemed to re-join the
> cluster just fine, so I began a repair.
>
> The repair has been running for about 33 hours (first incremental repair
> on this cluster), and every so often I'll see a line like this:
>
> [2017-08-31 00:18:16,300] Repair session f7ae4e71-8ce3-11e7-b466-79eba0383e4f
> for range [(-5606588017314999649,-5604469721630340065],
> (9047587767449433379,9047652965163017217]] failed with error Endpoint /
> 20.0.122.204 died (progress: 9%)
>
> Every one of these lines refers to the same node, 20.0.122.204.
>
> I'm mostly looking for guidance here. Do these errors indicate that the
> entire repair will be worthless, or just for token ranges shared by these
> two nodes? Is it normal to see error messages of this nature and for a
> repair not to terminate?
>
> Thanks,
> Paul
>