You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Léo FERLIN SUTTON <lf...@mailjet.com.INVALID> on 2019/02/06 17:15:00 UTC

Bootstrap keeps failing

Hello !

I am having a recurrent problem when trying to bootstrap a few new nodes.

Some general info :

   - I am running cassandra 3.0.17
   - We have about 30 nodes in our cluster
   - All healthy nodes have between 60% to 90% used disk space on
   /var/lib/cassandra

So I create a new node and let auto_bootstrap do it's job. After a few days
the bootstrapping node stops streaming new data but is still not a member
of the cluster.

`nodetool status` says the node is still joining,

When this happens I run `nodetool bootstrap resume`. This usually ends up
in two different ways :

   1. The node fills up to 100% disk space and crashes.
   2. The bootstrap resume finishes with errors

When I look at `nodetool netstats -H` is  looks like `bootstrap resume`
does not resume but restarts a full transfer of every data from every node.

This is the output I get from `nodetool resume` :

> [2019-02-06 01:39:14,369] received file
>> /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-225-big-Data.db
>> (progress: 2113%)
>
> [2019-02-06 01:39:16,821] received file
>> /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-88-big-Data.db
>> (progress: 2113%)
>
> [2019-02-06 01:39:17,003] received file
>> /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-89-big-Data.db
>> (progress: 2113%)
>
> [2019-02-06 01:39:17,032] session with /10.16.XX.YYY complete (progress:
>> 2113%)
>
> [2019-02-06 01:41:15,160] received file
>> /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-220-big-Data.db
>> (progress: 2113%)
>
> [2019-02-06 01:42:02,864] received file
>> /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-226-big-Data.db
>> (progress: 2113%)
>
> [2019-02-06 01:42:09,284] received file
>> /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-227-big-Data.db
>> (progress: 2113%)
>
> [2019-02-06 01:42:10,522] received file
>> /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-228-big-Data.db
>> (progress: 2113%)
>
> [2019-02-06 01:42:10,622] received file
>> /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-229-big-Data.db
>> (progress: 2113%)
>
> [2019-02-06 01:42:11,925] received file
>> /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-90-big-Data.db
>> (progress: 2114%)
>
> [2019-02-06 01:42:14,887] received file
>> /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-91-big-Data.db
>> (progress: 2114%)
>
> [2019-02-06 01:42:14,980] session with /10.16.XX.ZZZ complete (progress:
>> 2114%)
>
> [2019-02-06 01:42:14,980] Stream failed
>
> [2019-02-06 01:42:14,982] Error during bootstrap: Stream failed
>
> [2019-02-06 01:42:14,982] Resume bootstrap complete
>
>
The bootstrap `progress` goes way over 100% and eventually fails.


Right now I have a node with this output from `nodetool status` :
`UJ  10.16.XX.YYY  2.93 TB    256          ?
 5788f061-a3c0-46af-b712-ebeecd397bf7  c`

It is almost filled with data, yet if I look at `nodetool netstats` :

>         Receiving 480 files, 325.39 GB total. Already received 5 files,
> 68.32 MB total
>         Receiving 499 files, 328.96 GB total. Already received 1 files,
> 1.32 GB total
>         Receiving 506 files, 345.33 GB total. Already received 6 files,
> 24.19 MB total
>         Receiving 362 files, 206.73 GB total. Already received 7 files, 34
> MB total
>         Receiving 424 files, 281.25 GB total. Already received 1 files,
> 1.3 GB total
>         Receiving 581 files, 349.26 GB total. Already received 8 files,
> 45.96 MB total
>         Receiving 443 files, 337.26 GB total. Already received 6 files,
> 96.15 MB total
>         Receiving 424 files, 275.23 GB total. Already received 5 files,
> 42.67 MB total


It is trying to pull all the data again.

Am I missing something about the way `nodetool bootstrap resume` is
supposed to be used ?

Regards,

Leo

Re: [EXTERNAL] Re: Bootstrap keeps failing

Posted by Léo FERLIN SUTTON <lf...@mailjet.com.INVALID>.
Thank you for the recommendation.

We are already using datastax's recommended settings for tcp_keepalive.

Regards,

Leo

On Thu, Feb 7, 2019 at 5:49 PM Durity, Sean R <SE...@homedepot.com>
wrote:

> I have seen unreliable streaming (streaming that doesn’t finish) because
> of TCP timeouts from firewalls or switches. The default tcp_keepalive
> kernel parameters are usually not tuned for that. See
> https://docs.datastax.com/en/dse-trblshoot/doc/troubleshooting/idleFirewallLinux.html
> for more details. These “remote” timeouts are difficult to detect or prove
> if you don’t have access to the intermediate network equipment.
>
>
>
> Sean Durity
>
> *From:* Léo FERLIN SUTTON <lf...@mailjet.com.INVALID>
> *Sent:* Thursday, February 07, 2019 10:26 AM
> *To:* user@cassandra.apache.org; dinesh.joshi@yahoo.com
> *Subject:* [EXTERNAL] Re: Bootstrap keeps failing
>
>
>
> Hello !
>
> Thank you for your answers.
>
>
>
> So I have tried, multiple times, to start bootstrapping from scratch. I
> often have the same problem (on other nodes as well) but sometimes it works
> and I can move on to another node.
>
>
>
> I have joined a jstack dump and some logs.
>
>
>
> Our node was shut down at around 97% disk space used.
>
> I turned it back on and it starting the bootstrap process again.
>
>
>
> The log file is the log from this attempt, same for the thread dump.
>
>
>
> Small warning, I have somewhat anonymised the log files so there may be
> some inconsistencies.
>
>
>
> Regards,
>
>
>
> Leo
>
>
>
> On Thu, Feb 7, 2019 at 8:13 AM dinesh.joshi@yahoo.com.INVALID <
> dinesh.joshi@yahoo.com.invalid> wrote:
>
> Would it be possible for you to take a thread dump & logs and share them?
>
>
>
> Dinesh
>
>
>
>
>
> On Wednesday, February 6, 2019, 10:09:11 AM PST, Léo FERLIN SUTTON <
> lferlin@mailjet.com.INVALID> wrote:
>
>
>
>
>
> Hello !
>
>
>
> I am having a recurrent problem when trying to bootstrap a few new nodes.
>
>
>
> Some general info :
>
>    - I am running cassandra 3.0.17
>    - We have about 30 nodes in our cluster
>    - All healthy nodes have between 60% to 90% used disk space on
>    /var/lib/cassandra
>
> So I create a new node and let auto_bootstrap do it's job. After a few
> days the bootstrapping node stops streaming new data but is still not a
> member of the cluster.
>
>
>
> `nodetool status` says the node is still joining,
>
>
>
> When this happens I run `nodetool bootstrap resume`. This usually ends up
> in two different ways :
>
>    1. The node fills up to 100% disk space and crashes.
>    2. The bootstrap resume finishes with errors
>
> When I look at `nodetool netstats -H` is  looks like `bootstrap resume`
> does not resume but restarts a full transfer of every data from every node.
>
>
>
> This is the output I get from `nodetool resume` :
>
> [2019-02-06 01:39:14,369] received file
> /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-225-big-Data.db
> (progress: 2113%)
>
> [2019-02-06 01:39:16,821] received file
> /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-88-big-Data.db
> (progress: 2113%)
>
> [2019-02-06 01:39:17,003] received file
> /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-89-big-Data.db
> (progress: 2113%)
>
> [2019-02-06 01:39:17,032] session with /10.16.XX.YYY complete (progress:
> 2113%)
>
> [2019-02-06 01:41:15,160] received file
> /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-220-big-Data.db
> (progress: 2113%)
>
> [2019-02-06 01:42:02,864] received file
> /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-226-big-Data.db
> (progress: 2113%)
>
> [2019-02-06 01:42:09,284] received file
> /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-227-big-Data.db
> (progress: 2113%)
>
> [2019-02-06 01:42:10,522] received file
> /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-228-big-Data.db
> (progress: 2113%)
>
> [2019-02-06 01:42:10,622] received file
> /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-229-big-Data.db
> (progress: 2113%)
>
> [2019-02-06 01:42:11,925] received file
> /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-90-big-Data.db
> (progress: 2114%)
>
> [2019-02-06 01:42:14,887] received file
> /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-91-big-Data.db
> (progress: 2114%)
>
> [2019-02-06 01:42:14,980] session with /10.16.XX.ZZZ complete (progress:
> 2114%)
>
> [2019-02-06 01:42:14,980] Stream failed
>
> [2019-02-06 01:42:14,982] Error during bootstrap: Stream failed
>
> [2019-02-06 01:42:14,982] Resume bootstrap complete
>
>
>
> The bootstrap `progress` goes way over 100% and eventually fails.
>
>
>
>
>
> Right now I have a node with this output from `nodetool status` :
>
> `UJ  10.16.XX.YYY  2.93 TB    256          ?
>  5788f061-a3c0-46af-b712-ebeecd397bf7  c`
>
>
>
> It is almost filled with data, yet if I look at `nodetool netstats` :
>
>         Receiving 480 files, 325.39 GB total. Already received 5 files,
> 68.32 MB total
>         Receiving 499 files, 328.96 GB total. Already received 1 files,
> 1.32 GB total
>         Receiving 506 files, 345.33 GB total. Already received 6 files,
> 24.19 MB total
>         Receiving 362 files, 206.73 GB total. Already received 7 files, 34
> MB total
>         Receiving 424 files, 281.25 GB total. Already received 1 files,
> 1.3 GB total
>         Receiving 581 files, 349.26 GB total. Already received 8 files,
> 45.96 MB total
>         Receiving 443 files, 337.26 GB total. Already received 6 files,
> 96.15 MB total
>         Receiving 424 files, 275.23 GB total. Already received 5 files,
> 42.67 MB total
>
>
>
> It is trying to pull all the data again.
>
>
>
> Am I missing something about the way `nodetool bootstrap resume` is
> supposed to be used ?
>
>
>
> Regards,
>
>
>
> Leo
>
>
>
>
> ------------------------------
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>

RE: [EXTERNAL] Re: Bootstrap keeps failing

Posted by "Durity, Sean R" <SE...@homedepot.com>.
I have seen unreliable streaming (streaming that doesn’t finish) because of TCP timeouts from firewalls or switches. The default tcp_keepalive kernel parameters are usually not tuned for that. See https://docs.datastax.com/en/dse-trblshoot/doc/troubleshooting/idleFirewallLinux.html for more details. These “remote” timeouts are difficult to detect or prove if you don’t have access to the intermediate network equipment.

Sean Durity
From: Léo FERLIN SUTTON <lf...@mailjet.com.INVALID>
Sent: Thursday, February 07, 2019 10:26 AM
To: user@cassandra.apache.org; dinesh.joshi@yahoo.com
Subject: [EXTERNAL] Re: Bootstrap keeps failing

Hello !

Thank you for your answers.

So I have tried, multiple times, to start bootstrapping from scratch. I often have the same problem (on other nodes as well) but sometimes it works and I can move on to another node.

I have joined a jstack dump and some logs.

Our node was shut down at around 97% disk space used.
I turned it back on and it starting the bootstrap process again.

The log file is the log from this attempt, same for the thread dump.

Small warning, I have somewhat anonymised the log files so there may be some inconsistencies.

Regards,

Leo

On Thu, Feb 7, 2019 at 8:13 AM dinesh.joshi@yahoo.com.INVALID<ma...@yahoo.com.INVALID> <di...@yahoo.com.invalid>> wrote:
Would it be possible for you to take a thread dump & logs and share them?

Dinesh


On Wednesday, February 6, 2019, 10:09:11 AM PST, Léo FERLIN SUTTON <lf...@mailjet.com.INVALID>> wrote:


Hello !

I am having a recurrent problem when trying to bootstrap a few new nodes.

Some general info :

  *   I am running cassandra 3.0.17
  *   We have about 30 nodes in our cluster
  *   All healthy nodes have between 60% to 90% used disk space on /var/lib/cassandra
So I create a new node and let auto_bootstrap do it's job. After a few days the bootstrapping node stops streaming new data but is still not a member of the cluster.

`nodetool status` says the node is still joining,

When this happens I run `nodetool bootstrap resume`. This usually ends up in two different ways :

  1.  The node fills up to 100% disk space and crashes.
  2.  The bootstrap resume finishes with errors
When I look at `nodetool netstats -H` is  looks like `bootstrap resume` does not resume but restarts a full transfer of every data from every node.

This is the output I get from `nodetool resume` :
[2019-02-06 01:39:14,369] received file /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-225-big-Data.db (progress: 2113%)
[2019-02-06 01:39:16,821] received file /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-88-big-Data.db (progress: 2113%)
[2019-02-06 01:39:17,003] received file /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-89-big-Data.db (progress: 2113%)
[2019-02-06 01:39:17,032] session with /10.16.XX.YYY complete (progress: 2113%)
[2019-02-06 01:41:15,160] received file /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-220-big-Data.db (progress: 2113%)
[2019-02-06 01:42:02,864] received file /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-226-big-Data.db (progress: 2113%)
[2019-02-06 01:42:09,284] received file /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-227-big-Data.db (progress: 2113%)
[2019-02-06 01:42:10,522] received file /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-228-big-Data.db (progress: 2113%)
[2019-02-06 01:42:10,622] received file /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-229-big-Data.db (progress: 2113%)
[2019-02-06 01:42:11,925] received file /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-90-big-Data.db (progress: 2114%)
[2019-02-06 01:42:14,887] received file /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-91-big-Data.db (progress: 2114%)
[2019-02-06 01:42:14,980] session with /10.16.XX.ZZZ complete (progress: 2114%)
[2019-02-06 01:42:14,980] Stream failed
[2019-02-06 01:42:14,982] Error during bootstrap: Stream failed
[2019-02-06 01:42:14,982] Resume bootstrap complete

The bootstrap `progress` goes way over 100% and eventually fails.


Right now I have a node with this output from `nodetool status` :
`UJ  10.16.XX.YYY  2.93 TB    256          ?                 5788f061-a3c0-46af-b712-ebeecd397bf7  c`

It is almost filled with data, yet if I look at `nodetool netstats` :
        Receiving 480 files, 325.39 GB total. Already received 5 files, 68.32 MB total
        Receiving 499 files, 328.96 GB total. Already received 1 files, 1.32 GB total
        Receiving 506 files, 345.33 GB total. Already received 6 files, 24.19 MB total
        Receiving 362 files, 206.73 GB total. Already received 7 files, 34 MB total
        Receiving 424 files, 281.25 GB total. Already received 1 files, 1.3 GB total
        Receiving 581 files, 349.26 GB total. Already received 8 files, 45.96 MB total
        Receiving 443 files, 337.26 GB total. Already received 6 files, 96.15 MB total
        Receiving 424 files, 275.23 GB total. Already received 5 files, 42.67 MB total

It is trying to pull all the data again.

Am I missing something about the way `nodetool bootstrap resume` is supposed to be used ?

Regards,

Leo


________________________________

The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.

Re: Bootstrap keeps failing

Posted by Léo FERLIN SUTTON <lf...@mailjet.com.INVALID>.
Hello !

Thank you for your answers.

So I have tried, multiple times, to start bootstrapping from scratch. I
often have the same problem (on other nodes as well) but sometimes it works
and I can move on to another node.

I have joined a jstack dump and some logs.

Our node was shut down at around 97% disk space used.
I turned it back on and it starting the bootstrap process again.

The log file is the log from this attempt, same for the thread dump.

Small warning, I have somewhat anonymised the log files so there may be
some inconsistencies.

Regards,

Leo

On Thu, Feb 7, 2019 at 8:13 AM dinesh.joshi@yahoo.com.INVALID
<di...@yahoo.com.invalid> wrote:

> Would it be possible for you to take a thread dump & logs and share them?
>
> Dinesh
>
>
> On Wednesday, February 6, 2019, 10:09:11 AM PST, Léo FERLIN SUTTON
> <lf...@mailjet.com.INVALID> wrote:
>
>
> Hello !
>
> I am having a recurrent problem when trying to bootstrap a few new nodes.
>
> Some general info :
>
>    - I am running cassandra 3.0.17
>    - We have about 30 nodes in our cluster
>    - All healthy nodes have between 60% to 90% used disk space on
>    /var/lib/cassandra
>
> So I create a new node and let auto_bootstrap do it's job. After a few
> days the bootstrapping node stops streaming new data but is still not a
> member of the cluster.
>
> `nodetool status` says the node is still joining,
>
> When this happens I run `nodetool bootstrap resume`. This usually ends up
> in two different ways :
>
>    1. The node fills up to 100% disk space and crashes.
>    2. The bootstrap resume finishes with errors
>
> When I look at `nodetool netstats -H` is  looks like `bootstrap resume`
> does not resume but restarts a full transfer of every data from every node.
>
> This is the output I get from `nodetool resume` :
>
> [2019-02-06 01:39:14,369] received file
> /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-225-big-Data.db
> (progress: 2113%)
>
> [2019-02-06 01:39:16,821] received file
> /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-88-big-Data.db
> (progress: 2113%)
>
> [2019-02-06 01:39:17,003] received file
> /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-89-big-Data.db
> (progress: 2113%)
>
> [2019-02-06 01:39:17,032] session with /10.16.XX.YYY complete (progress:
> 2113%)
>
> [2019-02-06 01:41:15,160] received file
> /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-220-big-Data.db
> (progress: 2113%)
>
> [2019-02-06 01:42:02,864] received file
> /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-226-big-Data.db
> (progress: 2113%)
>
> [2019-02-06 01:42:09,284] received file
> /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-227-big-Data.db
> (progress: 2113%)
>
> [2019-02-06 01:42:10,522] received file
> /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-228-big-Data.db
> (progress: 2113%)
>
> [2019-02-06 01:42:10,622] received file
> /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-229-big-Data.db
> (progress: 2113%)
>
> [2019-02-06 01:42:11,925] received file
> /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-90-big-Data.db
> (progress: 2114%)
>
> [2019-02-06 01:42:14,887] received file
> /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-91-big-Data.db
> (progress: 2114%)
>
> [2019-02-06 01:42:14,980] session with /10.16.XX.ZZZ complete (progress:
> 2114%)
>
> [2019-02-06 01:42:14,980] Stream failed
>
> [2019-02-06 01:42:14,982] Error during bootstrap: Stream failed
>
> [2019-02-06 01:42:14,982] Resume bootstrap complete
>
>
> The bootstrap `progress` goes way over 100% and eventually fails.
>
>
> Right now I have a node with this output from `nodetool status` :
> `UJ  10.16.XX.YYY  2.93 TB    256          ?
>  5788f061-a3c0-46af-b712-ebeecd397bf7  c`
>
> It is almost filled with data, yet if I look at `nodetool netstats` :
>
>         Receiving 480 files, 325.39 GB total. Already received 5 files,
> 68.32 MB total
>         Receiving 499 files, 328.96 GB total. Already received 1 files,
> 1.32 GB total
>         Receiving 506 files, 345.33 GB total. Already received 6 files,
> 24.19 MB total
>         Receiving 362 files, 206.73 GB total. Already received 7 files, 34
> MB total
>         Receiving 424 files, 281.25 GB total. Already received 1 files,
> 1.3 GB total
>         Receiving 581 files, 349.26 GB total. Already received 8 files,
> 45.96 MB total
>         Receiving 443 files, 337.26 GB total. Already received 6 files,
> 96.15 MB total
>         Receiving 424 files, 275.23 GB total. Already received 5 files,
> 42.67 MB total
>
>
> It is trying to pull all the data again.
>
> Am I missing something about the way `nodetool bootstrap resume` is
> supposed to be used ?
>
> Regards,
>
> Leo
>
>

Re: Bootstrap keeps failing

Posted by "dinesh.joshi@yahoo.com.INVALID" <di...@yahoo.com.INVALID>.
Would it be possible for you to take a thread dump & logs and share them?
Dinesh 

    On Wednesday, February 6, 2019, 10:09:11 AM PST, Léo FERLIN SUTTON <lf...@mailjet.com.INVALID> wrote:  
 
 Hello !
I am having a recurrent problem when trying to bootstrap a few new nodes.
Some general info :    
   - I am running cassandra 3.0.17
   - We have about 30 nodes in our cluster
   - All healthy nodes have between 60% to 90% used disk space on /var/lib/cassandra   

So I create a new node and let auto_bootstrap do it's job. After a few days the bootstrapping node stops streaming new data but is still not a member of the cluster.
`nodetool status` says the node is still joining, 
When this happens I run `nodetool bootstrap resume`. This usually ends up in two different ways :   
   - The node fills up to 100% disk space and crashes.
   - The bootstrap resume finishes with errors
When I look at `nodetool netstats -H` is  looks like `bootstrap resume` does not resume but restarts a full transfer of every data from every node.
This is the output I get from `nodetool resume` :

[2019-02-06 01:39:14,369] received file /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-225-big-Data.db (progress: 2113%)

[2019-02-06 01:39:16,821] received file /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-88-big-Data.db (progress: 2113%)

[2019-02-06 01:39:17,003] received file /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-89-big-Data.db (progress: 2113%)

[2019-02-06 01:39:17,032] session with /10.16.XX.YYY complete (progress: 2113%)

[2019-02-06 01:41:15,160] received file /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-220-big-Data.db (progress: 2113%)

[2019-02-06 01:42:02,864] received file /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-226-big-Data.db (progress: 2113%)

[2019-02-06 01:42:09,284] received file /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-227-big-Data.db (progress: 2113%)

[2019-02-06 01:42:10,522] received file /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-228-big-Data.db (progress: 2113%)

[2019-02-06 01:42:10,622] received file /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-229-big-Data.db (progress: 2113%)

[2019-02-06 01:42:11,925] received file /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-90-big-Data.db (progress: 2114%)

[2019-02-06 01:42:14,887] received file /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-91-big-Data.db (progress: 2114%)

[2019-02-06 01:42:14,980] session with /10.16.XX.ZZZ complete (progress: 2114%)

[2019-02-06 01:42:14,980] Stream failed

[2019-02-06 01:42:14,982] Error during bootstrap: Stream failed

[2019-02-06 01:42:14,982] Resume bootstrap complete

  The bootstrap `progress` goes way over 100% and eventually fails.

Right now I have a node with this output from `nodetool status` : `UJ  10.16.XX.YYY  2.93 TB    256          ?                 5788f061-a3c0-46af-b712-ebeecd397bf7  c`
It is almost filled with data, yet if I look at `nodetool netstats` :
        Receiving 480 files, 325.39 GB total. Already received 5 files, 68.32 MB total
        Receiving 499 files, 328.96 GB total. Already received 1 files, 1.32 GB total
        Receiving 506 files, 345.33 GB total. Already received 6 files, 24.19 MB total
        Receiving 362 files, 206.73 GB total. Already received 7 files, 34 MB total
        Receiving 424 files, 281.25 GB total. Already received 1 files, 1.3 GB total
        Receiving 581 files, 349.26 GB total. Already received 8 files, 45.96 MB total
        Receiving 443 files, 337.26 GB total. Already received 6 files, 96.15 MB total
        Receiving 424 files, 275.23 GB total. Already received 5 files, 42.67 MB total

It is trying to pull all the data again.
Am I missing something about the way `nodetool bootstrap resume` is supposed to be used ?
Regards,
Leo
  

RE: Bootstrap keeps failing

Posted by Kenneth Brotman <ke...@yahoo.com.INVALID>.
Not sure off hand why that is happening but could you try bootstrapping that node from scratch again or try a different new node?

 

Kenneth Brotman

 

From: Léo FERLIN SUTTON [mailto:lferlin@mailjet.com.INVALID] 
Sent: Wednesday, February 06, 2019 9:15 AM
To: user@cassandra.apache.org
Subject: Bootstrap keeps failing

 

Hello !

 

I am having a recurrent problem when trying to bootstrap a few new nodes.

 

Some general info : 

*	I am running cassandra 3.0.17
*	We have about 30 nodes in our cluster
*	All healthy nodes have between 60% to 90% used disk space on /var/lib/cassandra

So I create a new node and let auto_bootstrap do it's job. After a few days the bootstrapping node stops streaming new data but is still not a member of the cluster.

 

`nodetool status` says the node is still joining, 

 

When this happens I run `nodetool bootstrap resume`. This usually ends up in two different ways :

1.	The node fills up to 100% disk space and crashes.
2.	The bootstrap resume finishes with errors

When I look at `nodetool netstats -H` is  looks like `bootstrap resume` does not resume but restarts a full transfer of every data from every node.

 

This is the output I get from `nodetool resume` :

[2019-02-06 01:39:14,369] received file /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-225-big-Data.db (progress: 2113%)

[2019-02-06 01:39:16,821] received file /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-88-big-Data.db (progress: 2113%)

[2019-02-06 01:39:17,003] received file /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-89-big-Data.db (progress: 2113%)

[2019-02-06 01:39:17,032] session with /10.16.XX.YYY complete (progress: 2113%)

[2019-02-06 01:41:15,160] received file /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-220-big-Data.db (progress: 2113%)

[2019-02-06 01:42:02,864] received file /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-226-big-Data.db (progress: 2113%)

[2019-02-06 01:42:09,284] received file /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-227-big-Data.db (progress: 2113%)

[2019-02-06 01:42:10,522] received file /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-228-big-Data.db (progress: 2113%)

[2019-02-06 01:42:10,622] received file /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-229-big-Data.db (progress: 2113%)

[2019-02-06 01:42:11,925] received file /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-90-big-Data.db (progress: 2114%)

[2019-02-06 01:42:14,887] received file /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-91-big-Data.db (progress: 2114%)

[2019-02-06 01:42:14,980] session with /1016.XX.ZZZ complete (progress: 2114%)

[2019-02-06 01:42:14,980] Stream failed

[2019-02-06 01:42:14,982] Error during bootstrap: Stream failed

[2019-02-06 01:42:14,982] Resume bootstrap complete

  

The bootstrap `progress` goes way over 100% and eventually fails.

 

 

Right now I have a node with this output from `nodetool status` : 

`UJ  10.16.XX.YYY  2.93 TB    256          ?                 5788f061-a3c0-46af-b712-ebeecd397bf7  c`

 

It is almost filled with data, yet if I look at `nodetool netstats` :

        Receiving 480 files, 325.39 GB total. Already received 5 files, 68.32 MB total
        Receiving 499 files, 328.96 GB total. Already received 1 files, 1.32 GB total
        Receiving 506 files, 345.33 GB total. Already received 6 files, 24.19 MB total
        Receiving 362 files, 206.73 GB total. Already received 7 files, 34 MB total
        Receiving 424 files, 281.25 GB total. Already received 1 files, 1.3 GB total
        Receiving 581 files, 349.26 GB total. Already received 8 files, 45.96 MB total
        Receiving 443 files, 337.26 GB total. Already received 6 files, 96.15 MB total
        Receiving 424 files, 275.23 GB total. Already received 5 files, 42.67 MB total

 

It is trying to pull all the data again.

 

Am I missing something about the way `nodetool bootstrap resume` is supposed to be used ?

 

Regards,

 

Leo