You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Marco Gasparini <ma...@competitoor.com.INVALID> on 2021/03/01 09:34:01 UTC

MISSING keyspace

hello everybody,

This morning, Monday!!!, I was checking on Cassandra cluster and I noticed
that all data was missing. I noticed the following error on each node (9
nodes in the cluster):










*2021-03-01 09:05:52,984 WARN  [MessagingService-Incoming-/x.x.x.x]
IncomingTcpConnection.java:103 run UnknownColumnFamilyException reading
from socket; closingorg.apache.cassandra.db.UnknownColumnFamilyException:
Couldn't find table for cfId cba90a70-5c46-11e9-9e36-f54fe3235e69. If a
table was just created, this is likely due to the schema not being fully
propagated.  Please wait for schema agreement on table creation.        at
org.apache.cassandra.config.CFMetaData$Serializer.deserialize(CFMetaData.java:1533)
      at
org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:758)
      at
org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:697)
      at
org.apache.cassandra.io.ForwardingVersionedSerializer.deserialize(ForwardingVersionedSerializer.java:50)
      at org.apache.cassandra.net.MessageIn.read(MessageIn.java:123)
at
org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:195)
      at
org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:183)
      at
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:94)*

I tried to query the keyspace and got this:

node1# cqlsh
Connected to Cassandra Cluster at x.x.x.x:9042.
[cqlsh 5.0.1 | Cassandra 3.11.5.1 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh> select * from mykeyspace.mytable  where id = 123935;
*InvalidRequest: Error from server: code=2200 [Invalid query]
message="Keyspace * *mykeyspace  does not exist"*

Investigating on each node I found that all the *SStables exist*, so I
think data is still there but the keyspace vanished, "magically".

Other facts I can tell you are:

   - I have been getting Anticompaction errors from 2 nodes due to the fact
   the disk was almost full.
   - the cluster was online friday
   - this morning, Monday, the whole cluster was offline and I noticed the
   problem of "missing keyspace"
   - During the weekend the cluster has been subject to inserts and deletes
   - I have a 9 node (HDD) Cassandra 3.11 cluster.

I really need help on this, how can I restore the cluster?

Thank you very much
Marco

Re: MISSING keyspace

Posted by Erick Ramirez <er...@datastax.com>.

As the warning message suggests, you need to check for schema disagreement.
My suspicion is that someone made a schema change and possibly dropped the
problematic keyspace.

FWIW I suspect the keyspace was dropped because the table isn't new -- CF
ID cba90a70-5c46-11e9-9e36-f54fe3235e69 is equivalent to 11 Apr 2019.

Check for the existence of the keyspace via cqlsh on other nodes in the
cluster (not the node which ran out of disk space). Cheers!

Re: MISSING keyspace

Posted by Marco Gasparini <ma...@competitoor.com.INVALID>.

hi @Erick,

Actually this timestamp *1614575293790 *is equivalent to
*                    GMT: Monday, 1 March 2021 05:08:13.790*
that stands for
                    *GMT+1: Monday, 1 March 2021 06:08:13.790* (my local
timezone).
This is consistent with the other logs time in the cluster.

Thank you for pointing me in some direction, I'll try to investigate this,
surely.



Il giorno mar 2 mar 2021 alle ore 00:56 Erick Ramirez <
erick.ramirez@datastax.com> ha scritto:

> The timestamp (1614575293790) in the snapshot directory name is equivalent
> to 1 March 16:08 GMT:
>
> actually I found a lot of .db files in the following directory:
>>
>> /var/lib/cassandra/data/mykespace/mytable-2795c0204a2d11e9aba361828766468f/snapshots/dropped-1614575293790-
>> mytable
>>
>
> which lines up nicely with this log entry:
>
>
>>              2021-03-01 06:08:08,864 INFO  [Native-Transport-Requests-1]
>> MigrationManager.java:542 announceKeyspaceDrop Drop Keyspace 'mykeyspace'
>>
>
> In any case, those 2 pieces of information are evidence that the keyspace
> didn't get randomly dropped -- some operator/developer/daemon/orchestration
> tool/whatever initiated it either intentionally or by accident.
>
> I've seen this happen a number of times where a developer thought they
> were connecting to dev/staging/test environment and issued a DROP or
> TRUNCATE not realising they were connected to production. Not saying this
> is what happened in your case but I'm just giving you ideas for your
> investigation. Cheers!
>

Re: MISSING keyspace

Posted by Erick Ramirez <er...@datastax.com>.

The timestamp (1614575293790) in the snapshot directory name is equivalent
to 1 March 16:08 GMT:

actually I found a lot of .db files in the following directory:
>
> /var/lib/cassandra/data/mykespace/mytable-2795c0204a2d11e9aba361828766468f/snapshots/dropped-1614575293790-
> mytable
>

which lines up nicely with this log entry:


>              2021-03-01 06:08:08,864 INFO  [Native-Transport-Requests-1]
> MigrationManager.java:542 announceKeyspaceDrop Drop Keyspace 'mykeyspace'
>

In any case, those 2 pieces of information are evidence that the keyspace
didn't get randomly dropped -- some operator/developer/daemon/orchestration
tool/whatever initiated it either intentionally or by accident.

I've seen this happen a number of times where a developer thought they were
connecting to dev/staging/test environment and issued a DROP or TRUNCATE
not realising they were connected to production. Not saying this is what
happened in your case but I'm just giving you ideas for your investigation.
Cheers!

Re: MISSING keyspace

Posted by Marco Gasparini <ma...@competitoor.com.INVALID>.

 I haven't made any schema modifications for a year or more.
This problem came up during a "normal day of work" for Cassandra.


Il giorno lun 1 mar 2021 alle ore 16:25 Bowen Song <bo...@bso.ng.invalid>
ha scritto:

> Your missing keyspace problem has nothing to do with that bug.
>
> In that case, the same table was created twice in a very short period of
> time, and I suspect that was done concurrently on two different nodes. The
> evidence lies in the two CF IDs - bd7200a0156711e88974855d74ee356f and
> bd750de0156711e8bdc54f7bcdcb851f, which are created at
> 2018-02-19T11:26:33.898 and 2018-02-19T11:26:33.918 respectively, with a
> merely 20 milliseconds gap between them.
>
> TBH, It doesn't sound like a bug to me. Cassandra is eventually consistent
> by design, and two conflicting schema changes on two different nodes at
> nearly the same time will likely result in schema disagreement and
> Cassandra will eventually reach agreement again, and possibly discarding
> one of the conflicting schema change, together with all data written to the
> discarded table/columns. To make sure this doesn't happen to your data, you
> should avoid doing multiple schema changes to the same keyspace (for
> create/alter/... keyspace) or same table (for create/alter/... table) on
> two or more Cassandra coordinator nodes in a very short period of time.
> Instead, send all your schema change queries to the same coordinator node,
> or if that's not possible, wait for at least 30 seconds between two schema
> changes and make sure you aren't restarting any node at the same time.
>
> On 01/03/2021 14:04, Marco Gasparini wrote:
>
> actually I found a lot of .db files in the following directory:
>
> /var/lib/cassandra/data/mykespace/mytable-2795c0204a2d11e9aba361828766468f/snapshots/dropped-1614575293790-mytable
>
> I also found this:
>              2021-03-01 06:08:08,864 INFO  [Native-Transport-Requests-1]
> MigrationManager.java:542 announceKeyspaceDrop Drop Keyspace 'mykeyspace'
>
> so I think that you, @erick and @bowen, are right. Something dropped the
> keyspace.
>
> I will try to follow your procedure @bowen, thank you very much!
>
> Do you know what could cause this issue?
> It seems like a big issue. I found this bug
> https://issues.apache.org/jira/browse/CASSANDRA-14957?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel,
> maybe they are correlated...
>
> Thank you @Bowen and @Erick
>
>
>
>
>
> Il giorno lun 1 mar 2021 alle ore 13:39 Bowen Song <bo...@bso.ng.invalid>
> <bo...@bso.ng.invalid> ha scritto:
>
>> The warning message indicates the node y.y.y.y went down (or is
>> unreachable via network) before 2021-02-28 05:17:33. Is there any chance
>> you can find the log file on that node at around or before that time? It
>> may show why did that node go down. The reason of that might be irrelevant
>> to the missing keyspace, but still worth to have a look in order to prevent
>> the same thing from happening again.
>>
>> As Erick said, the table's CF ID isn't new, so it's unlikely to be a
>> schema synchronization issue. Therefore I also suspect the keyspace was
>> accidentally dropped. Cassandra only logs "Drop Keyspace 'keyspace_name'"
>> on the node that received the "DROP KEYSPACE ..." query, so you may have to
>> search this in log files from all nodes to find it.
>>
>> Assuming the keyspace was dropped but you still have the SSTable files,
>> you can recover the data by re-creating the keyspace and tables with
>> identical replication strategy and schema, then copy the SSTable files to
>> the corresponding new table directories (with different CF ID suffixes) on
>> the same node, and finally run "nodetool refresh ..." or restart the node.
>> Since you don't yet have a full backup, I strongly recommend you to make a
>> backup, and ideally test restoring it to a different cluster, before
>> attempting to do this.
>>
>>
>> On 01/03/2021 11:48, Marco Gasparini wrote:
>>
>> here the previous error:
>>
>> 2021-02-28 05:17:33,262 WARN NodeConnectionsService.java:165
>> validateAndConnectIfNeeded failed to connect to node
>> {y.y.y.y}{9ba2d3ee-bc82-4e76-ae24-9e20eb334c24}{9ba2d3ee-bc82-4e76-ae24-9e20eb334c24}{y.y.y.y
>> }{ y.y.y.y :9300}{ALIVE}{rack=r1, dc=DC1} (tried [1] times)
>> org.elasticsearch.transport.ConnectTransportException: [ y.y.y.y ][
>> y.y.y.y :9300] connect_timeout[30s]
>> at
>> org.elasticsearch.transport.TcpChannel.awaitConnected(TcpChannel.java:163)
>> at
>> org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:616)
>> at
>> org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:513)
>> at
>> org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:336)
>> at
>> org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:323)
>> at
>> org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:156)
>> at
>> org.elasticsearch.cluster.NodeConnectionsService$ConnectionChecker.doRun(NodeConnectionsService.java:185)
>> at
>> org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672)
>> at
>> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>> at java.lang.Thread.run(Thread.java:748)
>>
>> Yes this node (y.y.y.y) stopped because it went out of disk space.
>>
>>
>> I said "deleted" because I'm not a native english speaker :)
>> I usually "remove" snapshots via 'nodetool clearsnapshot' or
>> cassandra-reaper user interface.
>>
>>
>>
>>
>> Il giorno lun 1 mar 2021 alle ore 12:39 Bowen Song <bo...@bso.ng.invalid>
>> <bo...@bso.ng.invalid> ha scritto:
>>
>>> What was the warning? Is it related to the disk failure policy? Could
>>> you please share the relevant log? You can edit it and redact the sensitive
>>> information before sharing it.
>>>
>>> Also, I can't help to notice that you used the word "delete" (instead of
>>> "clear") to describe the process of removing snapshots. May I ask how did
>>> you delete the snapshots? Was it "nodetool clearsnapshot ...", "rm -rf ..."
>>> or something else?
>>>
>>>
>>> On 01/03/2021 11:27, Marco Gasparini wrote:
>>>
>>> thanks Bowen for answering
>>>
>>> Actually, I checked the server log and the only warning was that a node
>>> went offline.
>>> No, I have no backups or snapshots.
>>>
>>> In the meantime I found that probably Cassandra moved all files from a
>>> directory to the snapshot directory. I am pretty sure of that because I
>>> have recently deleted all the snapshots I made because it was going out of
>>> disk space and I found this very directory full of files where the
>>> modification timestamp was the same as the first error I got in the log.
>>>
>>>
>>>
>>> Il giorno lun 1 mar 2021 alle ore 12:13 Bowen Song
>>> <bo...@bso.ng.invalid> <bo...@bso.ng.invalid> ha scritto:
>>>
>>>> The first thing I'd check is the server log. The log may contain vital
>>>> information about the cause of it, and that there may be different ways to
>>>> recover from it depending on the cause.
>>>>
>>>> Also, please allow me to ask a seemingly obvious question, do you have
>>>> a backup?
>>>>
>>>>
>>>> On 01/03/2021 09:34, Marco Gasparini wrote:
>>>>
>>>> hello everybody,
>>>>
>>>> This morning, Monday!!!, I was checking on Cassandra cluster and I
>>>> noticed that all data was missing. I noticed the following error on each
>>>> node (9 nodes in the cluster):
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *2021-03-01 09:05:52,984 WARN  [MessagingService-Incoming-/x.x.x.x]
>>>> IncomingTcpConnection.java:103 run UnknownColumnFamilyException reading
>>>> from socket; closing org.apache.cassandra.db.UnknownColumnFamilyException:
>>>> Couldn't find table for cfId cba90a70-5c46-11e9-9e36-f54fe3235e69. If a
>>>> table was just created, this is likely due to the schema not being fully
>>>> propagated.  Please wait for schema agreement on table creation.         at
>>>> org.apache.cassandra.config.CFMetaData$Serializer.deserialize(CFMetaData.java:1533)
>>>>         at
>>>> org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:758)
>>>>         at
>>>> org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:697)
>>>>         at
>>>> org.apache.cassandra.io.ForwardingVersionedSerializer.deserialize(ForwardingVersionedSerializer.java:50)
>>>>         at org.apache.cassandra.net.MessageIn.read(MessageIn.java:123)
>>>>     at
>>>> org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:195)
>>>>         at
>>>> org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:183)
>>>>         at
>>>> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:94)*
>>>>
>>>> I tried to query the keyspace and got this:
>>>>
>>>> node1# cqlsh
>>>> Connected to Cassandra Cluster at x.x.x.x:9042.
>>>> [cqlsh 5.0.1 | Cassandra 3.11.5.1 | CQL spec 3.4.4 | Native protocol v4]
>>>> Use HELP for help.
>>>> cqlsh> select * from mykeyspace.mytable  where id = 123935;
>>>> *InvalidRequest: Error from server: code=2200 [Invalid query]
>>>> message="Keyspace * *mykeyspace  does not exist"*
>>>>
>>>> Investigating on each node I found that all the *SStables exist*, so I
>>>> think data is still there but the keyspace vanished, "magically".
>>>>
>>>> Other facts I can tell you are:
>>>>
>>>>    - I have been getting Anticompaction errors from 2 nodes due to the
>>>>    fact the disk was almost full.
>>>>    - the cluster was online friday
>>>>    - this morning, Monday, the whole cluster was offline and I noticed
>>>>    the problem of "missing keyspace"
>>>>    - During the weekend the cluster has been subject to inserts and
>>>>    deletes
>>>>    - I have a 9 node (HDD) Cassandra 3.11 cluster.
>>>>
>>>> I really need help on this, how can I restore the cluster?
>>>>
>>>> Thank you very much
>>>> Marco
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>

Re: MISSING keyspace

Posted by Bowen Song <bo...@bso.ng.INVALID>.

Your missing keyspace problem has nothing to do with that bug.

In that case, the same table was created twice in a very short period of 
time, and I suspect that was done concurrently on two different nodes. 
The evidence lies in the two CF IDs - bd7200a0156711e88974855d74ee356f 
and bd750de0156711e8bdc54f7bcdcb851f, which are created at 
2018-02-19T11:26:33.898 and 2018-02-19T11:26:33.918 respectively, with a 
merely 20 milliseconds gap between them.

TBH, It doesn't sound like a bug to me. Cassandra is eventually 
consistent by design, and two conflicting schema changes on two 
different nodes at nearly the same time will likely result in schema 
disagreement and Cassandra will eventually reach agreement again, and 
possibly discarding one of the conflicting schema change, together with 
all data written to the discarded table/columns. To make sure this 
doesn't happen to your data, you should avoid doing multiple schema 
changes to the same keyspace (for create/alter/... keyspace) or same 
table (for create/alter/... table) on two or more Cassandra coordinator 
nodes in a very short period of time. Instead, send all your schema 
change queries to the same coordinator node, or if that's not possible, 
wait for at least 30 seconds between two schema changes and make sure 
you aren't restarting any node at the same time.


On 01/03/2021 14:04, Marco Gasparini wrote:
> actually I found a lot of .db files in the following directory:
> /var/lib/cassandra/data/mykespace/mytable-2795c0204a2d11e9aba361828766468f/snapshots/dropped-1614575293790-mytable
>
> I also found this:
>              2021-03-01 06:08:08,864 INFO 
>  [Native-Transport-Requests-1] MigrationManager.java:542 
> announceKeyspaceDrop Drop Keyspace 'mykeyspace'
>
> so I think that you, @erick and @bowen, are right. Something dropped 
> the keyspace.
>
> I will try to follow your procedure @bowen, thank you very much!
>
> Do you know what could cause this issue?
> It seems like a big issue. I found this bug 
> https://issues.apache.org/jira/browse/CASSANDRA-14957?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel 
> <https://issues.apache.org/jira/browse/CASSANDRA-14957?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel>, 
> maybe they are correlated...
>
> Thank you @Bowen and @Erick
>
>
>
>
>
> Il giorno lun 1 mar 2021 alle ore 13:39 Bowen Song 
> <bo...@bso.ng.invalid> ha scritto:
>
>     The warning message indicates the node y.y.y.y went down (or is
>     unreachable via network) before 2021-02-28 05:17:33. Is there any
>     chance you can find the log file on that node at around or before
>     that time? It may show why did that node go down. The reason of
>     that might be irrelevant to the missing keyspace, but still worth
>     to have a look in order to prevent the same thing from happening
>     again.
>
>     As Erick said, the table's CF ID isn't new, so it's unlikely to be
>     a schema synchronization issue. Therefore I also suspect the
>     keyspace was accidentally dropped. Cassandra only logs "Drop
>     Keyspace 'keyspace_name'" on the node that received the "DROP
>     KEYSPACE ..." query, so you may have to search this in log files
>     from all nodes to find it.
>
>     Assuming the keyspace was dropped but you still have the SSTable
>     files, you can recover the data by re-creating the keyspace and
>     tables with identical replication strategy and schema, then copy
>     the SSTable files to the corresponding new table directories (with
>     different CF ID suffixes) on the same node, and finally run
>     "nodetool refresh ..." or restart the node. Since you don't yet
>     have a full backup, I strongly recommend you to make a backup, and
>     ideally test restoring it to a different cluster, before
>     attempting to do this.
>
>
>     On 01/03/2021 11:48, Marco Gasparini wrote:
>>     here the previous error:
>>
>>     2021-02-28 05:17:33,262 WARN NodeConnectionsService.java:165
>>     validateAndConnectIfNeeded failed to connect to node
>>     {y.y.y.y}{9ba2d3ee-bc82-4e76-ae24-9e20eb334c24}{9ba2d3ee-bc82-4e76-ae24-9e20eb334c24}{y.y.y.y
>>     }{ y.y.y.y :9300}{ALIVE}{rack=r1, dc=DC1} (tried [1] times)
>>     org.elasticsearch.transport.ConnectTransportException: [ y.y.y.y
>>     ][ y.y.y.y :9300] connect_timeout[30s]
>>     at
>>     org.elasticsearch.transport.TcpChannel.awaitConnected(TcpChannel.java:163)
>>     at
>>     org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:616)
>>     at
>>     org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:513)
>>     at
>>     org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:336)
>>     at
>>     org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:323)
>>     at
>>     org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:156)
>>     at
>>     org.elasticsearch.cluster.NodeConnectionsService$ConnectionChecker.doRun(NodeConnectionsService.java:185)
>>     at
>>     org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672)
>>     at
>>     org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>>     at
>>     java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>     at
>>     java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>     at java.lang.Thread.run(Thread.java:748)
>>
>>     Yes this node (y.y.y.y) stopped because it went out of disk space.
>>
>>
>>     I said "deleted" because I'm not a native english speaker :)
>>     I usually "remove" snapshots via 'nodetool clearsnapshot' or
>>     cassandra-reaper user interface.
>>
>>
>>
>>
>>     Il giorno lun 1 mar 2021 alle ore 12:39 Bowen Song
>>     <bo...@bso.ng.invalid> <ma...@bso.ng.invalid> ha scritto:
>>
>>         What was the warning? Is it related to the disk failure
>>         policy? Could you please share the relevant log? You can edit
>>         it and redact the sensitive information before sharing it.
>>
>>         Also, I can't help to notice that you used the word "delete"
>>         (instead of "clear") to describe the process of removing
>>         snapshots. May I ask how did you delete the snapshots? Was it
>>         "nodetool clearsnapshot ...", "rm -rf ..." or something else?
>>
>>
>>         On 01/03/2021 11:27, Marco Gasparini wrote:
>>>         thanks Bowen for answering
>>>
>>>         Actually, I checked the server log and the only warning was
>>>         that a node went offline.
>>>         No, I have no backups or snapshots.
>>>
>>>         In the meantime I found that probably Cassandra moved all
>>>         files from a directory to the snapshot directory. I am
>>>         pretty sure of that because I have recently deleted all the
>>>         snapshots I made because it was going out of disk space and
>>>         I found this very directory full of files where the
>>>         modification timestamp was the same as the first error I got
>>>         in the log.
>>>
>>>
>>>
>>>         Il giorno lun 1 mar 2021 alle ore 12:13 Bowen Song
>>>         <bo...@bso.ng.invalid> <ma...@bso.ng.invalid> ha scritto:
>>>
>>>             The first thing I'd check is the server log. The log may
>>>             contain vital information about the cause of it, and
>>>             that there may be different ways to recover from it
>>>             depending on the cause.
>>>
>>>             Also, please allow me to ask a seemingly obvious
>>>             question, do you have a backup?
>>>
>>>
>>>             On 01/03/2021 09:34, Marco Gasparini wrote:
>>>>             hello everybody,
>>>>
>>>>             This morning, Monday!!!, I was checking on Cassandra
>>>>             cluster and I noticed that all data was missing. I
>>>>             noticed the following error on each node (9 nodes in
>>>>             the cluster):
>>>>
>>>>             *2021-03-01 09:05:52,984 WARN
>>>>              [MessagingService-Incoming-/x.x.x.x]
>>>>             IncomingTcpConnection.java:103 run
>>>>             UnknownColumnFamilyException reading from socket; closing
>>>>             org.apache.cassandra.db.UnknownColumnFamilyException:
>>>>             Couldn't find table for cfId
>>>>             cba90a70-5c46-11e9-9e36-f54fe3235e69. If a table was
>>>>             just created, this is likely due to the schema not
>>>>             being fully propagated.  Please wait for schema
>>>>             agreement on table creation.
>>>>                     at
>>>>             org.apache.cassandra.config.CFMetaData$Serializer.deserialize(CFMetaData.java:1533)
>>>>                     at
>>>>             org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:758)
>>>>                     at
>>>>             org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:697)
>>>>                     at
>>>>             org.apache.cassandra.io.ForwardingVersionedSerializer.deserialize(ForwardingVersionedSerializer.java:50)
>>>>                     at
>>>>             org.apache.cassandra.net.MessageIn.read(MessageIn.java:123)
>>>>                     at
>>>>             org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:195)
>>>>                     at
>>>>             org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:183)
>>>>                     at
>>>>             org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:94)*
>>>>             *
>>>>             *
>>>>             I tried to query the keyspace and got this:
>>>>
>>>>             node1# cqlsh
>>>>             Connected to Cassandra Cluster at x.x.x.x:9042.
>>>>             [cqlsh 5.0.1 | Cassandra 3.11.5.1 | CQL spec 3.4.4 |
>>>>             Native protocol v4]
>>>>             Use HELP for help.
>>>>             cqlsh> select * from mykeyspace.mytable  where id = 123935;
>>>>             *InvalidRequest: Error from server: code=2200 [Invalid
>>>>             query] message="Keyspace * *mykeyspace  does not exist"*
>>>>             *
>>>>             *
>>>>             Investigating on each node I found that all the
>>>>             *SStables exist*, so I think data is still there but
>>>>             the keyspace vanished, "magically".
>>>>
>>>>             Other facts I can tell you are:
>>>>
>>>>               * I have been getting Anticompaction errors from 2
>>>>                 nodes due to the fact the disk was almost full.
>>>>               * the cluster was online friday
>>>>               * this morning, Monday, the whole cluster was offline
>>>>                 and I noticed the problem of "missing keyspace"
>>>>               * During the weekend the cluster has been subject to
>>>>                 inserts and deletes
>>>>               * I have a 9 node (HDD) Cassandra 3.11 cluster.
>>>>
>>>>             I really need help on this, how can I restore the cluster?
>>>>
>>>>             Thank you very much
>>>>             Marco
>>>>
>>>>
>>>>             *
>>>>             *
>>>>             *
>>>>             *
>>>>             *
>>>>             *
>>>>             *
>>>>             *
>>>>             *
>>>>             *
>>>>

Re: MISSING keyspace

Posted by Marco Gasparini <ma...@competitoor.com.INVALID>.

actually I found a lot of .db files in the following directory:

/var/lib/cassandra/data/mykespace/mytable-2795c0204a2d11e9aba361828766468f/snapshots/dropped-1614575293790-
mytable

I also found this:
             2021-03-01 06:08:08,864 INFO  [Native-Transport-Requests-1]
MigrationManager.java:542 announceKeyspaceDrop Drop Keyspace 'mykeyspace'

so I think that you, @erick and @bowen, are right. Something dropped the
keyspace.

I will try to follow your procedure @bowen, thank you very much!

Do you know what could cause this issue?
It seems like a big issue. I found this bug
https://issues.apache.org/jira/browse/CASSANDRA-14957?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel,
maybe they are correlated...

Thank you @Bowen and @Erick





Il giorno lun 1 mar 2021 alle ore 13:39 Bowen Song <bo...@bso.ng.invalid>
ha scritto:

> The warning message indicates the node y.y.y.y went down (or is
> unreachable via network) before 2021-02-28 05:17:33. Is there any chance
> you can find the log file on that node at around or before that time? It
> may show why did that node go down. The reason of that might be irrelevant
> to the missing keyspace, but still worth to have a look in order to prevent
> the same thing from happening again.
>
> As Erick said, the table's CF ID isn't new, so it's unlikely to be a
> schema synchronization issue. Therefore I also suspect the keyspace was
> accidentally dropped. Cassandra only logs "Drop Keyspace 'keyspace_name'"
> on the node that received the "DROP KEYSPACE ..." query, so you may have to
> search this in log files from all nodes to find it.
>
> Assuming the keyspace was dropped but you still have the SSTable files,
> you can recover the data by re-creating the keyspace and tables with
> identical replication strategy and schema, then copy the SSTable files to
> the corresponding new table directories (with different CF ID suffixes) on
> the same node, and finally run "nodetool refresh ..." or restart the node.
> Since you don't yet have a full backup, I strongly recommend you to make a
> backup, and ideally test restoring it to a different cluster, before
> attempting to do this.
>
>
> On 01/03/2021 11:48, Marco Gasparini wrote:
>
> here the previous error:
>
> 2021-02-28 05:17:33,262 WARN NodeConnectionsService.java:165
> validateAndConnectIfNeeded failed to connect to node
> {y.y.y.y}{9ba2d3ee-bc82-4e76-ae24-9e20eb334c24}{9ba2d3ee-bc82-4e76-ae24-9e20eb334c24}{y.y.y.y
> }{ y.y.y.y :9300}{ALIVE}{rack=r1, dc=DC1} (tried [1] times)
> org.elasticsearch.transport.ConnectTransportException: [ y.y.y.y ][
> y.y.y.y :9300] connect_timeout[30s]
> at
> org.elasticsearch.transport.TcpChannel.awaitConnected(TcpChannel.java:163)
> at
> org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:616)
> at
> org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:513)
> at
> org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:336)
> at
> org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:323)
> at
> org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:156)
> at
> org.elasticsearch.cluster.NodeConnectionsService$ConnectionChecker.doRun(NodeConnectionsService.java:185)
> at
> org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672)
> at
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>
> Yes this node (y.y.y.y) stopped because it went out of disk space.
>
>
> I said "deleted" because I'm not a native english speaker :)
> I usually "remove" snapshots via 'nodetool clearsnapshot' or
> cassandra-reaper user interface.
>
>
>
>
> Il giorno lun 1 mar 2021 alle ore 12:39 Bowen Song <bo...@bso.ng.invalid>
> <bo...@bso.ng.invalid> ha scritto:
>
>> What was the warning? Is it related to the disk failure policy? Could you
>> please share the relevant log? You can edit it and redact the sensitive
>> information before sharing it.
>>
>> Also, I can't help to notice that you used the word "delete" (instead of
>> "clear") to describe the process of removing snapshots. May I ask how did
>> you delete the snapshots? Was it "nodetool clearsnapshot ...", "rm -rf ..."
>> or something else?
>>
>>
>> On 01/03/2021 11:27, Marco Gasparini wrote:
>>
>> thanks Bowen for answering
>>
>> Actually, I checked the server log and the only warning was that a node
>> went offline.
>> No, I have no backups or snapshots.
>>
>> In the meantime I found that probably Cassandra moved all files from a
>> directory to the snapshot directory. I am pretty sure of that because I
>> have recently deleted all the snapshots I made because it was going out of
>> disk space and I found this very directory full of files where the
>> modification timestamp was the same as the first error I got in the log.
>>
>>
>>
>> Il giorno lun 1 mar 2021 alle ore 12:13 Bowen Song <bo...@bso.ng.invalid>
>> <bo...@bso.ng.invalid> ha scritto:
>>
>>> The first thing I'd check is the server log. The log may contain vital
>>> information about the cause of it, and that there may be different ways to
>>> recover from it depending on the cause.
>>>
>>> Also, please allow me to ask a seemingly obvious question, do you have a
>>> backup?
>>>
>>>
>>> On 01/03/2021 09:34, Marco Gasparini wrote:
>>>
>>> hello everybody,
>>>
>>> This morning, Monday!!!, I was checking on Cassandra cluster and I
>>> noticed that all data was missing. I noticed the following error on each
>>> node (9 nodes in the cluster):
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *2021-03-01 09:05:52,984 WARN  [MessagingService-Incoming-/x.x.x.x]
>>> IncomingTcpConnection.java:103 run UnknownColumnFamilyException reading
>>> from socket; closing org.apache.cassandra.db.UnknownColumnFamilyException:
>>> Couldn't find table for cfId cba90a70-5c46-11e9-9e36-f54fe3235e69. If a
>>> table was just created, this is likely due to the schema not being fully
>>> propagated.  Please wait for schema agreement on table creation.         at
>>> org.apache.cassandra.config.CFMetaData$Serializer.deserialize(CFMetaData.java:1533)
>>>         at
>>> org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:758)
>>>         at
>>> org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:697)
>>>         at
>>> org.apache.cassandra.io.ForwardingVersionedSerializer.deserialize(ForwardingVersionedSerializer.java:50)
>>>         at org.apache.cassandra.net.MessageIn.read(MessageIn.java:123)
>>>     at
>>> org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:195)
>>>         at
>>> org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:183)
>>>         at
>>> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:94)*
>>>
>>> I tried to query the keyspace and got this:
>>>
>>> node1# cqlsh
>>> Connected to Cassandra Cluster at x.x.x.x:9042.
>>> [cqlsh 5.0.1 | Cassandra 3.11.5.1 | CQL spec 3.4.4 | Native protocol v4]
>>> Use HELP for help.
>>> cqlsh> select * from mykeyspace.mytable  where id = 123935;
>>> *InvalidRequest: Error from server: code=2200 [Invalid query]
>>> message="Keyspace * *mykeyspace  does not exist"*
>>>
>>> Investigating on each node I found that all the *SStables exist*, so I
>>> think data is still there but the keyspace vanished, "magically".
>>>
>>> Other facts I can tell you are:
>>>
>>>    - I have been getting Anticompaction errors from 2 nodes due to the
>>>    fact the disk was almost full.
>>>    - the cluster was online friday
>>>    - this morning, Monday, the whole cluster was offline and I noticed
>>>    the problem of "missing keyspace"
>>>    - During the weekend the cluster has been subject to inserts and
>>>    deletes
>>>    - I have a 9 node (HDD) Cassandra 3.11 cluster.
>>>
>>> I really need help on this, how can I restore the cluster?
>>>
>>> Thank you very much
>>> Marco
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>

Re: MISSING keyspace

Posted by Bowen Song <bo...@bso.ng.INVALID>.

The warning message indicates the node y.y.y.y went down (or is 
unreachable via network) before 2021-02-28 05:17:33. Is there any chance 
you can find the log file on that node at around or before that time? It 
may show why did that node go down. The reason of that might be 
irrelevant to the missing keyspace, but still worth to have a look in 
order to prevent the same thing from happening again.

As Erick said, the table's CF ID isn't new, so it's unlikely to be a 
schema synchronization issue. Therefore I also suspect the keyspace was 
accidentally dropped. Cassandra only logs "Drop Keyspace 
'keyspace_name'" on the node that received the "DROP KEYSPACE ..." 
query, so you may have to search this in log files from all nodes to 
find it.

Assuming the keyspace was dropped but you still have the SSTable files, 
you can recover the data by re-creating the keyspace and tables with 
identical replication strategy and schema, then copy the SSTable files 
to the corresponding new table directories (with different CF ID 
suffixes) on the same node, and finally run "nodetool refresh ..." or 
restart the node. Since you don't yet have a full backup, I strongly 
recommend you to make a backup, and ideally test restoring it to a 
different cluster, before attempting to do this.


On 01/03/2021 11:48, Marco Gasparini wrote:
> here the previous error:
>
> 2021-02-28 05:17:33,262 WARN NodeConnectionsService.java:165 
> validateAndConnectIfNeeded failed to connect to node 
> {y.y.y.y}{9ba2d3ee-bc82-4e76-ae24-9e20eb334c24}{9ba2d3ee-bc82-4e76-ae24-9e20eb334c24}{y.y.y.y 
> }{ y.y.y.y :9300}{ALIVE}{rack=r1, dc=DC1} (tried [1] times)
> org.elasticsearch.transport.ConnectTransportException: [ y.y.y.y ][ 
> y.y.y.y :9300] connect_timeout[30s]
> at 
> org.elasticsearch.transport.TcpChannel.awaitConnected(TcpChannel.java:163)
> at 
> org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:616)
> at 
> org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:513)
> at 
> org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:336)
> at 
> org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:323)
> at 
> org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:156)
> at 
> org.elasticsearch.cluster.NodeConnectionsService$ConnectionChecker.doRun(NodeConnectionsService.java:185)
> at 
> org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672)
> at 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>
> Yes this node (y.y.y.y) stopped because it went out of disk space.
>
>
> I said "deleted" because I'm not a native english speaker :)
> I usually "remove" snapshots via 'nodetool clearsnapshot' or 
> cassandra-reaper user interface.
>
>
>
>
> Il giorno lun 1 mar 2021 alle ore 12:39 Bowen Song 
> <bo...@bso.ng.invalid> ha scritto:
>
>     What was the warning? Is it related to the disk failure policy?
>     Could you please share the relevant log? You can edit it and
>     redact the sensitive information before sharing it.
>
>     Also, I can't help to notice that you used the word "delete"
>     (instead of "clear") to describe the process of removing
>     snapshots. May I ask how did you delete the snapshots? Was it
>     "nodetool clearsnapshot ...", "rm -rf ..." or something else?
>
>
>     On 01/03/2021 11:27, Marco Gasparini wrote:
>>     thanks Bowen for answering
>>
>>     Actually, I checked the server log and the only warning was that
>>     a node went offline.
>>     No, I have no backups or snapshots.
>>
>>     In the meantime I found that probably Cassandra moved all files
>>     from a directory to the snapshot directory. I am pretty sure of
>>     that because I have recently deleted all the snapshots I made
>>     because it was going out of disk space and I found this very
>>     directory full of files where the modification timestamp was the
>>     same as the first error I got in the log.
>>
>>
>>
>>     Il giorno lun 1 mar 2021 alle ore 12:13 Bowen Song
>>     <bo...@bso.ng.invalid> <ma...@bso.ng.invalid> ha scritto:
>>
>>         The first thing I'd check is the server log. The log may
>>         contain vital information about the cause of it, and that
>>         there may be different ways to recover from it depending on
>>         the cause.
>>
>>         Also, please allow me to ask a seemingly obvious question, do
>>         you have a backup?
>>
>>
>>         On 01/03/2021 09:34, Marco Gasparini wrote:
>>>         hello everybody,
>>>
>>>         This morning, Monday!!!, I was checking on Cassandra cluster
>>>         and I noticed that all data was missing. I noticed the
>>>         following error on each node (9 nodes in the cluster):
>>>
>>>         *2021-03-01 09:05:52,984 WARN
>>>          [MessagingService-Incoming-/x.x.x.x]
>>>         IncomingTcpConnection.java:103 run
>>>         UnknownColumnFamilyException reading from socket; closing
>>>         org.apache.cassandra.db.UnknownColumnFamilyException:
>>>         Couldn't find table for cfId
>>>         cba90a70-5c46-11e9-9e36-f54fe3235e69. If a table was just
>>>         created, this is likely due to the schema not being fully
>>>         propagated. Please wait for schema agreement on table creation.
>>>                 at
>>>         org.apache.cassandra.config.CFMetaData$Serializer.deserialize(CFMetaData.java:1533)
>>>                 at
>>>         org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:758)
>>>                 at
>>>         org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:697)
>>>                 at
>>>         org.apache.cassandra.io.ForwardingVersionedSerializer.deserialize(ForwardingVersionedSerializer.java:50)
>>>                 at
>>>         org.apache.cassandra.net.MessageIn.read(MessageIn.java:123)
>>>                 at
>>>         org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:195)
>>>                 at
>>>         org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:183)
>>>                 at
>>>         org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:94)*
>>>         *
>>>         *
>>>         I tried to query the keyspace and got this:
>>>
>>>         node1# cqlsh
>>>         Connected to Cassandra Cluster at x.x.x.x:9042.
>>>         [cqlsh 5.0.1 | Cassandra 3.11.5.1 | CQL spec 3.4.4 | Native
>>>         protocol v4]
>>>         Use HELP for help.
>>>         cqlsh> select * from mykeyspace.mytable where id = 123935;
>>>         *InvalidRequest: Error from server: code=2200 [Invalid
>>>         query] message="Keyspace * *mykeyspace  does not exist"*
>>>         *
>>>         *
>>>         Investigating on each node I found that all the *SStables
>>>         exist*, so I think data is still there but the keyspace
>>>         vanished, "magically".
>>>
>>>         Other facts I can tell you are:
>>>
>>>           * I have been getting Anticompaction errors from 2 nodes
>>>             due to the fact the disk was almost full.
>>>           * the cluster was online friday
>>>           * this morning, Monday, the whole cluster was offline and
>>>             I noticed the problem of "missing keyspace"
>>>           * During the weekend the cluster has been subject to
>>>             inserts and deletes
>>>           * I have a 9 node (HDD) Cassandra 3.11 cluster.
>>>
>>>         I really need help on this, how can I restore the cluster?
>>>
>>>         Thank you very much
>>>         Marco
>>>
>>>
>>>         *
>>>         *
>>>         *
>>>         *
>>>         *
>>>         *
>>>         *
>>>         *
>>>         *
>>>         *
>>>

Re: MISSING keyspace

Posted by Marco Gasparini <ma...@competitoor.com.INVALID>.

here the previous error:

2021-02-28 05:17:33,262 WARN NodeConnectionsService.java:165
validateAndConnectIfNeeded failed to connect to node
{y.y.y.y}{9ba2d3ee-bc82-4e76-ae24-9e20eb334c24}{9ba2d3ee-bc82-4e76-ae24-9e20eb334c24}{
y.y.y.y }{ y.y.y.y :9300}{ALIVE}{rack=r1, dc=DC1} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [ y.y.y.y ][ y.y.y.y
:9300] connect_timeout[30s]
at
org.elasticsearch.transport.TcpChannel.awaitConnected(TcpChannel.java:163)
at
org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:616)
at
org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:513)
at
org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:336)
at
org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:323)
at
org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:156)
at
org.elasticsearch.cluster.NodeConnectionsService$ConnectionChecker.doRun(NodeConnectionsService.java:185)
at
org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672)
at
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Yes this node (y.y.y.y) stopped because it went out of disk space.


I said "deleted" because I'm not a native english speaker :)
I usually "remove" snapshots via 'nodetool clearsnapshot' or
cassandra-reaper user interface.




Il giorno lun 1 mar 2021 alle ore 12:39 Bowen Song <bo...@bso.ng.invalid>
ha scritto:

> What was the warning? Is it related to the disk failure policy? Could you
> please share the relevant log? You can edit it and redact the sensitive
> information before sharing it.
>
> Also, I can't help to notice that you used the word "delete" (instead of
> "clear") to describe the process of removing snapshots. May I ask how did
> you delete the snapshots? Was it "nodetool clearsnapshot ...", "rm -rf ..."
> or something else?
>
>
> On 01/03/2021 11:27, Marco Gasparini wrote:
>
> thanks Bowen for answering
>
> Actually, I checked the server log and the only warning was that a node
> went offline.
> No, I have no backups or snapshots.
>
> In the meantime I found that probably Cassandra moved all files from a
> directory to the snapshot directory. I am pretty sure of that because I
> have recently deleted all the snapshots I made because it was going out of
> disk space and I found this very directory full of files where the
> modification timestamp was the same as the first error I got in the log.
>
>
>
> Il giorno lun 1 mar 2021 alle ore 12:13 Bowen Song <bo...@bso.ng.invalid>
> <bo...@bso.ng.invalid> ha scritto:
>
>> The first thing I'd check is the server log. The log may contain vital
>> information about the cause of it, and that there may be different ways to
>> recover from it depending on the cause.
>>
>> Also, please allow me to ask a seemingly obvious question, do you have a
>> backup?
>>
>>
>> On 01/03/2021 09:34, Marco Gasparini wrote:
>>
>> hello everybody,
>>
>> This morning, Monday!!!, I was checking on Cassandra cluster and I
>> noticed that all data was missing. I noticed the following error on each
>> node (9 nodes in the cluster):
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *2021-03-01 09:05:52,984 WARN  [MessagingService-Incoming-/x.x.x.x]
>> IncomingTcpConnection.java:103 run UnknownColumnFamilyException reading
>> from socket; closing org.apache.cassandra.db.UnknownColumnFamilyException:
>> Couldn't find table for cfId cba90a70-5c46-11e9-9e36-f54fe3235e69. If a
>> table was just created, this is likely due to the schema not being fully
>> propagated.  Please wait for schema agreement on table creation.         at
>> org.apache.cassandra.config.CFMetaData$Serializer.deserialize(CFMetaData.java:1533)
>>         at
>> org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:758)
>>         at
>> org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:697)
>>         at
>> org.apache.cassandra.io.ForwardingVersionedSerializer.deserialize(ForwardingVersionedSerializer.java:50)
>>         at org.apache.cassandra.net.MessageIn.read(MessageIn.java:123)
>>     at
>> org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:195)
>>         at
>> org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:183)
>>         at
>> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:94)*
>>
>> I tried to query the keyspace and got this:
>>
>> node1# cqlsh
>> Connected to Cassandra Cluster at x.x.x.x:9042.
>> [cqlsh 5.0.1 | Cassandra 3.11.5.1 | CQL spec 3.4.4 | Native protocol v4]
>> Use HELP for help.
>> cqlsh> select * from mykeyspace.mytable  where id = 123935;
>> *InvalidRequest: Error from server: code=2200 [Invalid query]
>> message="Keyspace * *mykeyspace  does not exist"*
>>
>> Investigating on each node I found that all the *SStables exist*, so I
>> think data is still there but the keyspace vanished, "magically".
>>
>> Other facts I can tell you are:
>>
>>    - I have been getting Anticompaction errors from 2 nodes due to the
>>    fact the disk was almost full.
>>    - the cluster was online friday
>>    - this morning, Monday, the whole cluster was offline and I noticed
>>    the problem of "missing keyspace"
>>    - During the weekend the cluster has been subject to inserts and
>>    deletes
>>    - I have a 9 node (HDD) Cassandra 3.11 cluster.
>>
>> I really need help on this, how can I restore the cluster?
>>
>> Thank you very much
>> Marco
>>
>>
>>
>>
>>
>>
>>
>>
>>

Re: MISSING keyspace

Posted by Bowen Song <bo...@bso.ng.INVALID>.

What was the warning? Is it related to the disk failure policy? Could 
you please share the relevant log? You can edit it and redact the 
sensitive information before sharing it.

Also, I can't help to notice that you used the word "delete" (instead of 
"clear") to describe the process of removing snapshots. May I ask how 
did you delete the snapshots? Was it "nodetool clearsnapshot ...", "rm 
-rf ..." or something else?


On 01/03/2021 11:27, Marco Gasparini wrote:
> thanks Bowen for answering
>
> Actually, I checked the server log and the only warning was that a 
> node went offline.
> No, I have no backups or snapshots.
>
> In the meantime I found that probably Cassandra moved all files from a 
> directory to the snapshot directory. I am pretty sure of that because 
> I have recently deleted all the snapshots I made because it was going 
> out of disk space and I found this very directory full of files where 
> the modification timestamp was the same as the first error I got in 
> the log.
>
>
>
> Il giorno lun 1 mar 2021 alle ore 12:13 Bowen Song 
> <bo...@bso.ng.invalid> ha scritto:
>
>     The first thing I'd check is the server log. The log may contain
>     vital information about the cause of it, and that there may be
>     different ways to recover from it depending on the cause.
>
>     Also, please allow me to ask a seemingly obvious question, do you
>     have a backup?
>
>
>     On 01/03/2021 09:34, Marco Gasparini wrote:
>>     hello everybody,
>>
>>     This morning, Monday!!!, I was checking on Cassandra cluster and
>>     I noticed that all data was missing. I noticed the following
>>     error on each node (9 nodes in the cluster):
>>
>>     *2021-03-01 09:05:52,984 WARN
>>      [MessagingService-Incoming-/x.x.x.x]
>>     IncomingTcpConnection.java:103 run UnknownColumnFamilyException
>>     reading from socket; closing
>>     org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't
>>     find table for cfId cba90a70-5c46-11e9-9e36-f54fe3235e69. If a
>>     table was just created, this is likely due to the schema not
>>     being fully propagated.  Please wait for schema agreement on
>>     table creation.
>>             at
>>     org.apache.cassandra.config.CFMetaData$Serializer.deserialize(CFMetaData.java:1533)
>>             at
>>     org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:758)
>>             at
>>     org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:697)
>>             at
>>     org.apache.cassandra.io.ForwardingVersionedSerializer.deserialize(ForwardingVersionedSerializer.java:50)
>>             at
>>     org.apache.cassandra.net.MessageIn.read(MessageIn.java:123)
>>             at
>>     org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:195)
>>             at
>>     org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:183)
>>             at
>>     org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:94)*
>>     *
>>     *
>>     I tried to query the keyspace and got this:
>>
>>     node1# cqlsh
>>     Connected to Cassandra Cluster at x.x.x.x:9042.
>>     [cqlsh 5.0.1 | Cassandra 3.11.5.1 | CQL spec 3.4.4 | Native
>>     protocol v4]
>>     Use HELP for help.
>>     cqlsh> select * from mykeyspace.mytable  where id = 123935;
>>     *InvalidRequest: Error from server: code=2200 [Invalid query]
>>     message="Keyspace * *mykeyspace  does not exist"*
>>     *
>>     *
>>     Investigating on each node I found that all the *SStables exist*,
>>     so I think data is still there but the keyspace vanished,
>>     "magically".
>>
>>     Other facts I can tell you are:
>>
>>       * I have been getting Anticompaction errors from 2 nodes due to
>>         the fact the disk was almost full.
>>       * the cluster was online friday
>>       * this morning, Monday, the whole cluster was offline and I
>>         noticed the problem of "missing keyspace"
>>       * During the weekend the cluster has been subject to inserts
>>         and deletes
>>       * I have a 9 node (HDD) Cassandra 3.11 cluster.
>>
>>     I really need help on this, how can I restore the cluster?
>>
>>     Thank you very much
>>     Marco
>>
>>
>>     *
>>     *
>>     *
>>     *
>>     *
>>     *
>>     *
>>     *
>>     *
>>     *
>>

Re: MISSING keyspace

Posted by Marco Gasparini <ma...@competitoor.com.INVALID>.

thanks Bowen for answering

Actually, I checked the server log and the only warning was that a node
went offline.
No, I have no backups or snapshots.

In the meantime I found that probably Cassandra moved all files from a
directory to the snapshot directory. I am pretty sure of that because I
have recently deleted all the snapshots I made because it was going out of
disk space and I found this very directory full of files where the
modification timestamp was the same as the first error I got in the log.



Il giorno lun 1 mar 2021 alle ore 12:13 Bowen Song <bo...@bso.ng.invalid>
ha scritto:

> The first thing I'd check is the server log. The log may contain vital
> information about the cause of it, and that there may be different ways to
> recover from it depending on the cause.
>
> Also, please allow me to ask a seemingly obvious question, do you have a
> backup?
>
>
> On 01/03/2021 09:34, Marco Gasparini wrote:
>
> hello everybody,
>
> This morning, Monday!!!, I was checking on Cassandra cluster and I noticed
> that all data was missing. I noticed the following error on each node (9
> nodes in the cluster):
>
>
>
>
>
>
>
>
>
>
> *2021-03-01 09:05:52,984 WARN  [MessagingService-Incoming-/x.x.x.x]
> IncomingTcpConnection.java:103 run UnknownColumnFamilyException reading
> from socket; closing org.apache.cassandra.db.UnknownColumnFamilyException:
> Couldn't find table for cfId cba90a70-5c46-11e9-9e36-f54fe3235e69. If a
> table was just created, this is likely due to the schema not being fully
> propagated.  Please wait for schema agreement on table creation.         at
> org.apache.cassandra.config.CFMetaData$Serializer.deserialize(CFMetaData.java:1533)
>         at
> org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:758)
>         at
> org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:697)
>         at
> org.apache.cassandra.io.ForwardingVersionedSerializer.deserialize(ForwardingVersionedSerializer.java:50)
>         at org.apache.cassandra.net.MessageIn.read(MessageIn.java:123)
>     at
> org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:195)
>         at
> org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:183)
>         at
> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:94)*
>
> I tried to query the keyspace and got this:
>
> node1# cqlsh
> Connected to Cassandra Cluster at x.x.x.x:9042.
> [cqlsh 5.0.1 | Cassandra 3.11.5.1 | CQL spec 3.4.4 | Native protocol v4]
> Use HELP for help.
> cqlsh> select * from mykeyspace.mytable  where id = 123935;
> *InvalidRequest: Error from server: code=2200 [Invalid query]
> message="Keyspace * *mykeyspace  does not exist"*
>
> Investigating on each node I found that all the *SStables exist*, so I
> think data is still there but the keyspace vanished, "magically".
>
> Other facts I can tell you are:
>
>    - I have been getting Anticompaction errors from 2 nodes due to the
>    fact the disk was almost full.
>    - the cluster was online friday
>    - this morning, Monday, the whole cluster was offline and I noticed
>    the problem of "missing keyspace"
>    - During the weekend the cluster has been subject to inserts and
>    deletes
>    - I have a 9 node (HDD) Cassandra 3.11 cluster.
>
> I really need help on this, how can I restore the cluster?
>
> Thank you very much
> Marco
>
>
>
>
>
>
>
>
>

Re: MISSING keyspace

Posted by Bowen Song <bo...@bso.ng.INVALID>.

The first thing I'd check is the server log. The log may contain vital 
information about the cause of it, and that there may be different ways 
to recover from it depending on the cause.

Also, please allow me to ask a seemingly obvious question, do you have a 
backup?


On 01/03/2021 09:34, Marco Gasparini wrote:
> hello everybody,
>
> This morning, Monday!!!, I was checking on Cassandra cluster and I 
> noticed that all data was missing. I noticed the following error on 
> each node (9 nodes in the cluster):
>
> *2021-03-01 09:05:52,984 WARN  [MessagingService-Incoming-/x.x.x.x] 
> IncomingTcpConnection.java:103 run UnknownColumnFamilyException 
> reading from socket; closing
> org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find 
> table for cfId cba90a70-5c46-11e9-9e36-f54fe3235e69. If a table was 
> just created, this is likely due to the schema not being fully 
> propagated.  Please wait for schema agreement on table creation.
>         at 
> org.apache.cassandra.config.CFMetaData$Serializer.deserialize(CFMetaData.java:1533)
>         at 
> org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:758)
>         at 
> org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:697)
>         at 
> org.apache.cassandra.io.ForwardingVersionedSerializer.deserialize(ForwardingVersionedSerializer.java:50)
>         at org.apache.cassandra.net.MessageIn.read(MessageIn.java:123)
>         at 
> org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:195)
>         at 
> org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:183)
>         at 
> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:94)*
> *
> *
> I tried to query the keyspace and got this:
>
> node1# cqlsh
> Connected to Cassandra Cluster at x.x.x.x:9042.
> [cqlsh 5.0.1 | Cassandra 3.11.5.1 | CQL spec 3.4.4 | Native protocol v4]
> Use HELP for help.
> cqlsh> select * from mykeyspace.mytable  where id = 123935;
> *InvalidRequest: Error from server: code=2200 [Invalid query] 
> message="Keyspace * *mykeyspace  does not exist"*
> *
> *
> Investigating on each node I found that all the *SStables exist*, so I 
> think data is still there but the keyspace vanished, "magically".
>
> Other facts I can tell you are:
>
>   * I have been getting Anticompaction errors from 2 nodes due to the
>     fact the disk was almost full.
>   * the cluster was online friday
>   * this morning, Monday, the whole cluster was offline and I noticed
>     the problem of "missing keyspace"
>   * During the weekend the cluster has been subject to inserts and deletes
>   * I have a 9 node (HDD) Cassandra 3.11 cluster.
>
> I really need help on this, how can I restore the cluster?
>
> Thank you very much
> Marco
>
>
> *
> *
> *
> *
> *
> *
> *
> *
> *
> *
>