You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Dmitry Simonov <di...@gmail.com> on 2018/06/30 05:05:03 UTC

cqlsh COPY ... TO ... doesn't work if one node down

Hello!

I have cassandra cluster with 5 nodes.
There is a (relatively small) keyspace X with RF5.
One node goes down.

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load       Tokens       Owns (effective)  Host
ID                               Rack
UN  10.0.0.82   253.64 MB  256          100.0%
839bef9d-79af-422c-a21f-33bdcf4493c1  rack1
UN  10.0.0.154  255.92 MB  256          100.0%
ce23f3a7-67d2-47c0-9ece-7a5dd67c4105  rack1
UN  10.0.0.76   461.26 MB  256          100.0%
c8e18603-0ede-43f0-b713-3ff47ad92323  rack1
UN  10.0.0.94   575.78 MB  256          100.0%
9a324dbc-5ae1-4788-80e4-d86dcaae5a4c  rack1
DN  10.0.0.47   ?          256          100.0%
7b628ca2-4e47-457a-ba42-5191f7e5374b  rack1

I try to export some data using COPY TO, but it fails after long retries.
Why does it fail?
How can I make a copy?
There must be 4 copies of each row on other (alive) replicas.

cqlsh 10.0.0.154 -e "COPY X.Y TO 'backup/X.Y' WITH NUMPROCESSES=1"

Using 1 child processes

Starting copy of X.Y with columns [key, column1, value].
2018-06-29 19:12:23,661 Failed to create connection pool for new host
10.0.0.47:
Traceback (most recent call last):
  File "/usr/lib/foobar/lib/python3.5/site-packages/cassandra/cluster.py",
line 2476, in run_add_or_renew_pool
    new_pool = HostConnection(host, distance, self)
  File "/usr/lib/foobar/lib/python3.5/site-packages/cassandra/pool.py",
line 332, in __init__
    self._connection = session.cluster.connection_factory(host.address)
  File "/usr/lib/foobar/lib/python3.5/site-packages/cassandra/cluster.py",
line 1205, in connection_factory
    return self.connection_class.factory(address, self.connect_timeout,
*args, **kwargs)
  File
"/usr/lib/foobar/lib/python3.5/site-packages/cassandra/connection.py", line
332, in factory
    conn = cls(host, *args, **kwargs)
  File
"/usr/lib/foobar/lib/python3.5/site-packages/cassandra/io/asyncorereactor.py",
line 344, in __init__
    self._connect_socket()
  File
"/usr/lib/foobar/lib/python3.5/site-packages/cassandra/connection.py", line
371, in _connect_socket
    raise socket.error(sockerr.errno, "Tried connecting to %s. Last error:
%s" % ([a[4] for a in addresses], sockerr.strerror or sockerr))
OSError: [Errno None] Tried connecting to [('10.0.0.47', 9042)]. Last
error: timed out
2018-06-29 19:12:23,665 Host 10.0.0.47 has been marked down
2018-06-29 19:12:29,674 Error attempting to reconnect to 10.0.0.47,
scheduling retry in 2.0 seconds: [Errno None] Tried connecting to
[('10.0.0.47', 9042)]. Last error: timed out
2018-06-29 19:12:36,684 Error attempting to reconnect to 10.0.0.47,
scheduling retry in 4.0 seconds: [Errno None] Tried connecting to
[('10.0.0.47', 9042)]. Last error: timed out
2018-06-29 19:12:45,696 Error attempting to reconnect to 10.0.0.47,
scheduling retry in 8.0 seconds: [Errno None] Tried connecting to
[('10.0.0.47', 9042)]. Last error: timed out
2018-06-29 19:12:58,716 Error attempting to reconnect to 10.0.0.47,
scheduling retry in 16.0 seconds: [Errno None] Tried connecting to
[('10.0.0.47', 9042)]. Last error: timed out
2018-06-29 19:13:19,756 Error attempting to reconnect to 10.0.0.47,
scheduling retry in 32.0 seconds: [Errno None] Tried connecting to
[('10.0.0.47', 9042)]. Last error: timed out
2018-06-29 19:13:56,834 Error attempting to reconnect to 10.0.0.47,
scheduling retry in 64.0 seconds: [Errno None] Tried connecting to
[('10.0.0.47', 9042)]. Last error: timed out
2018-06-29 19:15:05,887 Error attempting to reconnect to 10.0.0.47,
scheduling retry in 128.0 seconds: [Errno None] Tried connecting to
[('10.0.0.47', 9042)]. Last error: timed out
2018-06-29 19:17:18,982 Error attempting to reconnect to 10.0.0.47,
scheduling retry in 256.0 seconds: [Errno None] Tried connecting to
[('10.0.0.47', 9042)]. Last error: timed out
2018-06-29 19:21:40,064 Error attempting to reconnect to 10.0.0.47,
scheduling retry in 512.0 seconds: [Errno None] Tried connecting to
[('10.0.0.47', 9042)]. Last error: timed out
<stdin>:1:(4, 'Interrupted system call')
IOError:
IOError:
IOError:
IOError:
IOError:


-- 
Best Regards,
Dmitry Simonov

Re: cqlsh COPY ... TO ... doesn't work if one node down

Posted by "@Nandan@" <na...@gmail.com>.

CQL Copy command will not work in case if you are trying to copy from all
NODES because COPY command will check all N nodes UP and RUNNING Status.
If you want to complete then you have 2 options:-
1) Remove DOWN NODE from COPY command
2) Make it UP and NORMAL  status.



On Mon, Jul 2, 2018 at 9:15 AM, Anup Shirolkar <
anup.shirolkar@instaclustr.com> wrote:

> Hi,
>
> The error shows that, the cqlsh connection with down node is failed.
> So, you should debug why it happened.
>
> Although, you have mentioned other node in cqlsh command '10.0.0.154'
> my guess is, the down node was present in connection pool, hence it was
> attempted for connection.
>
> Ideally the availability of data should not be hampered due
> to unavailability of one replica out of 5.
> Also the stack trace is about 'cqlsh' connection error.
>
> I think once you get your connection sorted, the COPY should work as usual.
>
> Regards,
> Anup
>
>
> On 30 June 2018 at 15:05, Dmitry Simonov <di...@gmail.com> wrote:
>
>> Hello!
>>
>> I have cassandra cluster with 5 nodes.
>> There is a (relatively small) keyspace X with RF5.
>> One node goes down.
>>
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address      Load       Tokens       Owns (effective)  Host
>> ID                               Rack
>> UN  10.0.0.82   253.64 MB  256          100.0%
>> 839bef9d-79af-422c-a21f-33bdcf4493c1  rack1
>> UN  10.0.0.154  255.92 MB  256          100.0%
>> ce23f3a7-67d2-47c0-9ece-7a5dd67c4105  rack1
>> UN  10.0.0.76   461.26 MB  256          100.0%
>> c8e18603-0ede-43f0-b713-3ff47ad92323  rack1
>> UN  10.0.0.94   575.78 MB  256          100.0%
>> 9a324dbc-5ae1-4788-80e4-d86dcaae5a4c  rack1
>> DN  10.0.0.47   ?          256          100.0%
>> 7b628ca2-4e47-457a-ba42-5191f7e5374b  rack1
>>
>> I try to export some data using COPY TO, but it fails after long retries.
>> Why does it fail?
>> How can I make a copy?
>> There must be 4 copies of each row on other (alive) replicas.
>>
>> cqlsh 10.0.0.154 -e "COPY X.Y TO 'backup/X.Y' WITH NUMPROCESSES=1"
>>
>> Using 1 child processes
>>
>> Starting copy of X.Y with columns [key, column1, value].
>> 2018-06-29 19:12:23,661 Failed to create connection pool for new host
>> 10.0.0.47:
>> Traceback (most recent call last):
>>   File "/usr/lib/foobar/lib/python3.5/site-packages/cassandra/cluster.py",
>> line 2476, in run_add_or_renew_pool
>>     new_pool = HostConnection(host, distance, self)
>>   File "/usr/lib/foobar/lib/python3.5/site-packages/cassandra/pool.py",
>> line 332, in __init__
>>     self._connection = session.cluster.connection_factory(host.address)
>>   File "/usr/lib/foobar/lib/python3.5/site-packages/cassandra/cluster.py",
>> line 1205, in connection_factory
>>     return self.connection_class.factory(address, self.connect_timeout,
>> *args, **kwargs)
>>   File "/usr/lib/foobar/lib/python3.5/site-packages/cassandra/connection.py",
>> line 332, in factory
>>     conn = cls(host, *args, **kwargs)
>>   File "/usr/lib/foobar/lib/python3.5/site-packages/cassandra/io/asyncorereactor.py",
>> line 344, in __init__
>>     self._connect_socket()
>>   File "/usr/lib/foobar/lib/python3.5/site-packages/cassandra/connection.py",
>> line 371, in _connect_socket
>>     raise socket.error(sockerr.errno, "Tried connecting to %s. Last
>> error: %s" % ([a[4] for a in addresses], sockerr.strerror or sockerr))
>> OSError: [Errno None] Tried connecting to [('10.0.0.47', 9042)]. Last
>> error: timed out
>> 2018-06-29 19:12:23,665 Host 10.0.0.47 has been marked down
>> 2018-06-29 19:12:29,674 Error attempting to reconnect to 10.0.0.47,
>> scheduling retry in 2.0 seconds: [Errno None] Tried connecting to
>> [('10.0.0.47', 9042)]. Last error: timed out
>> 2018-06-29 19:12:36,684 Error attempting to reconnect to 10.0.0.47,
>> scheduling retry in 4.0 seconds: [Errno None] Tried connecting to
>> [('10.0.0.47', 9042)]. Last error: timed out
>> 2018-06-29 19:12:45,696 Error attempting to reconnect to 10.0.0.47,
>> scheduling retry in 8.0 seconds: [Errno None] Tried connecting to
>> [('10.0.0.47', 9042)]. Last error: timed out
>> 2018-06-29 19:12:58,716 Error attempting to reconnect to 10.0.0.47,
>> scheduling retry in 16.0 seconds: [Errno None] Tried connecting to
>> [('10.0.0.47', 9042)]. Last error: timed out
>> 2018-06-29 19:13:19,756 Error attempting to reconnect to 10.0.0.47,
>> scheduling retry in 32.0 seconds: [Errno None] Tried connecting to
>> [('10.0.0.47', 9042)]. Last error: timed out
>> 2018-06-29 19:13:56,834 Error attempting to reconnect to 10.0.0.47,
>> scheduling retry in 64.0 seconds: [Errno None] Tried connecting to
>> [('10.0.0.47', 9042)]. Last error: timed out
>> 2018-06-29 19:15:05,887 Error attempting to reconnect to 10.0.0.47,
>> scheduling retry in 128.0 seconds: [Errno None] Tried connecting to
>> [('10.0.0.47', 9042)]. Last error: timed out
>> 2018-06-29 19:17:18,982 Error attempting to reconnect to 10.0.0.47,
>> scheduling retry in 256.0 seconds: [Errno None] Tried connecting to
>> [('10.0.0.47', 9042)]. Last error: timed out
>> 2018-06-29 19:21:40,064 Error attempting to reconnect to 10.0.0.47,
>> scheduling retry in 512.0 seconds: [Errno None] Tried connecting to
>> [('10.0.0.47', 9042)]. Last error: timed out
>> <stdin>:1:(4, 'Interrupted system call')
>> IOError:
>> IOError:
>> IOError:
>> IOError:
>> IOError:
>>
>>
>> --
>> Best Regards,
>> Dmitry Simonov
>>
>
>
>
> --
>
> Anup Shirolkar
>
> Consultant
>
> +61 420 602 338
>
> <https://www.instaclustr.com/solutions/managed-apache-kafka/>
>
> <https://www.facebook.com/instaclustr>   <https://twitter.com/instaclustr>
>    <https://www.linkedin.com/company/instaclustr>
>
> Read our latest technical blog posts here
> <https://www.instaclustr.com/blog/>.
>

Re: cqlsh COPY ... TO ... doesn't work if one node down

Posted by Anup Shirolkar <an...@instaclustr.com>.

Hi,

The error shows that, the cqlsh connection with down node is failed.
So, you should debug why it happened.

Although, you have mentioned other node in cqlsh command '10.0.0.154'
my guess is, the down node was present in connection pool, hence it was
attempted for connection.

Ideally the availability of data should not be hampered due
to unavailability of one replica out of 5.
Also the stack trace is about 'cqlsh' connection error.

I think once you get your connection sorted, the COPY should work as usual.

Regards,
Anup


On 30 June 2018 at 15:05, Dmitry Simonov <di...@gmail.com> wrote:

> Hello!
>
> I have cassandra cluster with 5 nodes.
> There is a (relatively small) keyspace X with RF5.
> One node goes down.
>
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address      Load       Tokens       Owns (effective)  Host
> ID                               Rack
> UN  10.0.0.82   253.64 MB  256          100.0%
> 839bef9d-79af-422c-a21f-33bdcf4493c1  rack1
> UN  10.0.0.154  255.92 MB  256          100.0%
> ce23f3a7-67d2-47c0-9ece-7a5dd67c4105  rack1
> UN  10.0.0.76   461.26 MB  256          100.0%
> c8e18603-0ede-43f0-b713-3ff47ad92323  rack1
> UN  10.0.0.94   575.78 MB  256          100.0%
> 9a324dbc-5ae1-4788-80e4-d86dcaae5a4c  rack1
> DN  10.0.0.47   ?          256          100.0%
> 7b628ca2-4e47-457a-ba42-5191f7e5374b  rack1
>
> I try to export some data using COPY TO, but it fails after long retries.
> Why does it fail?
> How can I make a copy?
> There must be 4 copies of each row on other (alive) replicas.
>
> cqlsh 10.0.0.154 -e "COPY X.Y TO 'backup/X.Y' WITH NUMPROCESSES=1"
>
> Using 1 child processes
>
> Starting copy of X.Y with columns [key, column1, value].
> 2018-06-29 19:12:23,661 Failed to create connection pool for new host
> 10.0.0.47:
> Traceback (most recent call last):
>   File "/usr/lib/foobar/lib/python3.5/site-packages/cassandra/cluster.py",
> line 2476, in run_add_or_renew_pool
>     new_pool = HostConnection(host, distance, self)
>   File "/usr/lib/foobar/lib/python3.5/site-packages/cassandra/pool.py",
> line 332, in __init__
>     self._connection = session.cluster.connection_factory(host.address)
>   File "/usr/lib/foobar/lib/python3.5/site-packages/cassandra/cluster.py",
> line 1205, in connection_factory
>     return self.connection_class.factory(address, self.connect_timeout,
> *args, **kwargs)
>   File "/usr/lib/foobar/lib/python3.5/site-packages/cassandra/connection.py",
> line 332, in factory
>     conn = cls(host, *args, **kwargs)
>   File "/usr/lib/foobar/lib/python3.5/site-packages/cassandra/io/asyncorereactor.py",
> line 344, in __init__
>     self._connect_socket()
>   File "/usr/lib/foobar/lib/python3.5/site-packages/cassandra/connection.py",
> line 371, in _connect_socket
>     raise socket.error(sockerr.errno, "Tried connecting to %s. Last error:
> %s" % ([a[4] for a in addresses], sockerr.strerror or sockerr))
> OSError: [Errno None] Tried connecting to [('10.0.0.47', 9042)]. Last
> error: timed out
> 2018-06-29 19:12:23,665 Host 10.0.0.47 has been marked down
> 2018-06-29 19:12:29,674 Error attempting to reconnect to 10.0.0.47,
> scheduling retry in 2.0 seconds: [Errno None] Tried connecting to
> [('10.0.0.47', 9042)]. Last error: timed out
> 2018-06-29 19:12:36,684 Error attempting to reconnect to 10.0.0.47,
> scheduling retry in 4.0 seconds: [Errno None] Tried connecting to
> [('10.0.0.47', 9042)]. Last error: timed out
> 2018-06-29 19:12:45,696 Error attempting to reconnect to 10.0.0.47,
> scheduling retry in 8.0 seconds: [Errno None] Tried connecting to
> [('10.0.0.47', 9042)]. Last error: timed out
> 2018-06-29 19:12:58,716 Error attempting to reconnect to 10.0.0.47,
> scheduling retry in 16.0 seconds: [Errno None] Tried connecting to
> [('10.0.0.47', 9042)]. Last error: timed out
> 2018-06-29 19:13:19,756 Error attempting to reconnect to 10.0.0.47,
> scheduling retry in 32.0 seconds: [Errno None] Tried connecting to
> [('10.0.0.47', 9042)]. Last error: timed out
> 2018-06-29 19:13:56,834 Error attempting to reconnect to 10.0.0.47,
> scheduling retry in 64.0 seconds: [Errno None] Tried connecting to
> [('10.0.0.47', 9042)]. Last error: timed out
> 2018-06-29 19:15:05,887 Error attempting to reconnect to 10.0.0.47,
> scheduling retry in 128.0 seconds: [Errno None] Tried connecting to
> [('10.0.0.47', 9042)]. Last error: timed out
> 2018-06-29 19:17:18,982 Error attempting to reconnect to 10.0.0.47,
> scheduling retry in 256.0 seconds: [Errno None] Tried connecting to
> [('10.0.0.47', 9042)]. Last error: timed out
> 2018-06-29 19:21:40,064 Error attempting to reconnect to 10.0.0.47,
> scheduling retry in 512.0 seconds: [Errno None] Tried connecting to
> [('10.0.0.47', 9042)]. Last error: timed out
> <stdin>:1:(4, 'Interrupted system call')
> IOError:
> IOError:
> IOError:
> IOError:
> IOError:
>
>
> --
> Best Regards,
> Dmitry Simonov
>



-- 

Anup Shirolkar

Consultant

+61 420 602 338

<https://www.instaclustr.com/solutions/managed-apache-kafka/>

<https://www.facebook.com/instaclustr>   <https://twitter.com/instaclustr>
<https://www.linkedin.com/company/instaclustr>

Read our latest technical blog posts here
<https://www.instaclustr.com/blog/>.