You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Oleg Proudnikov <ol...@cloudorange.com> on 2011/01/20 21:15:52 UTC

UnserializableColumnFamilyException: Couldn't find cfId

Hi All,

Could you please help me understand the impact on my data?

I am running a 6 node 0.7-rc4 Cassandra cluster with RF=2. Schema was defined
when the cluster was created and did not change. I am doing batch load with
CL=ONE. The cluster is under some stress in memory and I/O. Each node has 1G
heap. CPU is around 10% but the latency is high. 

I saw this exception on 2 out of 6 nodes in a relatively short window of time. 
Hector clients received no exception and the nodes continued running. The
exception has not happened since even though the load is continuing. 
I do get an occasional OOM and I am adjusting thresholds and other 
settings as I go. I also doubled RAM to 2G since the exception.

Here is the exception - the same stack trace in all cases.
org.apache.cassandra.db.UnserializableColumnFamilyException: C
ouldn't find cfId=1004
 at org.apache.cassandra.db.ColumnFamilySerializer.deserialize
(ColumnFamilySerializer.java:117)
 at org.apache.cassandra.db.RowMutationSerializer.defreezeTheMaps
(RowMutation.java:385)
 at org.apache.cassandra.db.RowMutationSerializer.deserialize
 (RowMutation.java:395)
 at org.apache.cassandra.db.RowMutationSerializer.deserialize
 (RowMutation.java:353)
 at org.apache.cassandra.db.RowMutationVerbHandler.doVerb
(RowMutationVerbHandler.java:52)
 at org.apache.cassandra.net.MessageDeliveryTask.run
 (MessageDeliveryTask.java:63)
 at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)


It refers to two cfIds - cfId=1004 and cfId=1013. Mutation stages are always
different even for the exceptions appearing within the same millisecond.
As you can see below cfId=004 appears on both nodes several times but at
different times while cfId=0013 appears only once on one node.

It happened as a group within one second on one node and in 5 groups spread
across 45 minutes on another node. I left the first log entry of each group.

xxx.xxx.xxx.140 grep -i cfid -B 1 log/cassandra.log
xxx.xxx.xxx.141 grep -i cfid -B 1 log/cassandra.log
xxx.xxx.xxx.142 grep -i cfid -B 1 log/cassandra.log
xxx.xxx.xxx.143 grep -i cfid -B 1 log/cassandra.log


xxx.xxx.xxx.144 grep -i cfid -B 1 log/cassandra.log
ERROR [MutationStage:11] 2011-01-14 15:02:03,911 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException: 
Couldn't find cfId=1004


xxx.xxx.xxx.145 grep -i cfid -B 1 log/cassandra.log
ERROR [MutationStage:1] 2011-01-14 15:02:34,460 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException: 
Couldn't find cfId=1004
--
ERROR [MutationStage:13] 2011-01-14 15:03:28,637 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException: 
Couldn't find cfId=1004
--
ERROR [MutationStage:27] 2011-01-14 15:05:02,513 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException: 
Couldn't find cfId=1004
--
ERROR [MutationStage:4] 2011-01-14 15:12:30,731 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException: 
Couldn't find cfId=1004
--
ERROR [MutationStage:23] 2011-01-14 15:47:03,416 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException: 
Couldn't find cfId=1013



Q. What does this mean for the consistency? Am I still within my guarantee of
CL=ONE? 



NOTE: I experienced similar exceptions in 0.7-rc2 but at that time cfIds looked
corrupted. They were random/negative and these exceptions 
were followed by an OOM with an attempt to allocate a huge HeapByteBuffer.

Thank you very much,
Oleg




Re: UnserializableColumnFamilyException: Couldn't find cfId

Posted by Ching-Cheng Chen <cc...@evidentsoftware.com>.
We have similar exception before, and the root cause was like Aaron
mentioned.

You will encounter this exception If you have code create CF on the fly and
data was insert into the node which hasn't got schema synced yet.

You will have to call describe_schema_version() to ensure all nodes has same
schema before you start insert data into newly create CF.

Regards,

Chen

On Thu, Jan 20, 2011 at 5:34 PM, Aaron Morton <aa...@thelastpickle.com>wrote:

> Sounds like there are multiple versions of your schema around the cluster.
> What client API are you using? Does it support
> the describe_schema_versions() function? This will tell you how many
> versions there are.
>
> The easy solutions here is scrub the data and start a new 0.7 cluster using
> the release version.If possible you should not use data created in the non
> release versions once you get to production.
>
> Hope that helps.
> Aaron
>
>
> On 21 Jan, 2011,at 09:15 AM, Oleg Proudnikov <ol...@cloudorange.com>
> wrote:
>
> Hi All,
>
> Could you please help me understand the impact on my data?
>
> I am running a 6 node 0.7-rc4 Cassandra cluster with RF=2. Schema was
> defined
> when the cluster was created and did not change. I am doing batch load with
> CL=ONE. The cluster is under some stress in memory and I/O. Each node has
> 1G
> heap. CPU is around 10% but the latency is high.
>
> I saw this exception on 2 out of 6 nodes in a relatively short window of
> time.
> Hector clients received no exception and the nodes continued running. The
> exception has not happened since even though the load is continuing.
> I do get an occasional OOM and I am adjusting thresholds and other
> settings as I go. I also doubled RAM to 2G since the exception.
>
> Here is the exception - the same stack trace in all cases.
> org.apache.cassandra.db.UnserializableColumnFamilyException: C
> ouldn't find cfId=1004
> at org.apache.cassandra.dbColumnFamilySerializer.deserialize
>
> (ColumnFamilySerializer.java:117)
> at org.apache.cassandra.db.RowMutationSerializer.defreezeTheMaps
> (RowMutation.java:385)
> at org.apache.cassandra.db.RowMutationSerializer.deserialize
> (RowMutation.java:395)
> at org.apache.cassandra.db.RowMutationSerializer.deserialize
> (RowMutation.java:353)
> at org.apache.cassandra.db.RowMutationVerbHandler.doVerb
> (RowMutationVerbHandler.java:52)
> at org.apache.cassandra.net.MessageDeliveryTask.run
> (MessageDeliveryTask.java:63)
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> at java.lang.Thread.run(Unknown Source)
>
>
> It refers to two cfIds - cfId=1004 and cfId=1013. Mutation stages are
> always
> different even for the exceptions appearing within the same millisecond.
> As you can see below cfId=004 appears on both nodes several times but at
> different times while cfId=0013 appears only once on one node.
>
> It happened as a group within one second on one node and in 5 groups spread
> across 45 minutes on another node. I left the first log entry of each
> group.
>
> xxx.xxx.xxx.140 grep -i cfid -B 1 log/cassandra.log
> xxx.xxx.xxx.141 grep -i cfid -B 1 log/cassandra.log
> xxx.xxx.xxx.142 grep -i cfid -B 1 log/cassandra.log
> xxx.xxx.xxx.143 grep -i cfid -B 1 log/cassandra.log
>
>
> xxx.xxx.xxx.144 grep -i cfid -B 1 log/cassandra.log
> ERROR [MutationStage:11] 2011-01-14 15:02:03,911
> RowMutationVerbHandler.java
> (line 83) Error in row mutation
> org.apache.cassandra.db.UnserializableColumnFamilyException:
> Couldn't find cfId=1004
>
>
> xxx.xxx.xxx.145 grep -i cfid -B 1 log/cassandra.log
> ERROR [MutationStage:1] 2011-01-14 15:02:34,460 RowMutationVerbHandler.java
> (line 83) Error in row mutation
> org.apache.cassandra.db.UnserializableColumnFamilyException:
> Couldn't find cfId=1004
> --
> ERROR [MutationStage:13] 2011-01-14 15:03:28,637
> RowMutationVerbHandler.java
> (line 83) Error in row mutation
> org.apache.cassandra.db.UnserializableColumnFamilyException:
> Couldn't find cfId=1004
> --
> ERROR [MutationStage:27] 2011-01-14 15:05:02,513
> RowMutationVerbHandler.java
> (line 83) Error in row mutation
> org.apache.cassandra.db.UnserializableColumnFamilyException:
> Couldn't find cfId=1004
> --
> ERROR [MutationStage:4] 2011-01-14 15:12:30,731 RowMutationVerbHandler.java
> (line 83) Error in row mutation
> org.apache.cassandra.db.UnserializableColumnFamilyException:
> Couldn't find cfId=1004
> --
> ERROR [MutationStage:23] 2011-01-14 15:47:03,416
> RowMutationVerbHandler.java
> (line 83) Error in row mutation
> org.apache.cassandra.db.UnserializableColumnFamilyException:
> Couldn't find cfId=1013
>
>
>
> Q. What does this mean for the consistency? Am I still within my guarantee
> of
> CL=ONE?
>
>
>
> NOTE: I experienced similar exceptions in 0.7-rc2 but at that time cfIds
> looked
> corrupted. They were random/negative and these exceptions
> were followed by an OOM with an attempt to allocate a huge HeapByteBuffer.
>
> Thank you very much,
> Oleg
>
>
>
>

Re: UnserializableColumnFamilyException: Couldn't find cfId

Posted by Aaron Morton <aa...@thelastpickle.com>.
Sounds like there are multiple versions of your schema around the cluster. What client API are you using? Does it support the describe_schema_versions() function? This will tell you how many versions there are. 

The easy solutions here is scrub the data and start a new 0.7 cluster using the release version.If possible you should not use data created in the non release versions once you get to production. 

Hope that helps. 
Aaron


On 21 Jan, 2011,at 09:15 AM, Oleg Proudnikov <ol...@cloudorange.com> wrote:

Hi All,

Could you please help me understand the impact on my data?

I am running a 6 node 0.7-rc4 Cassandra cluster with RF=2. Schema was defined
when the cluster was created and did not change. I am doing batch load with
CL=ONE. The cluster is under some stress in memory and I/O. Each node has 1G
heap. CPU is around 10% but the latency is high. 

I saw this exception on 2 out of 6 nodes in a relatively short window of time. 
Hector clients received no exception and the nodes continued running. The
exception has not happened since even though the load is continuing. 
I do get an occasional OOM and I am adjusting thresholds and other 
settings as I go. I also doubled RAM to 2G since the exception.

Here is the exception - the same stack trace in all cases.
org.apache.cassandra.db.UnserializableColumnFamilyException: C
ouldn't find cfId=1004
at org.apache.cassandra.db.ColumnFamilySerializer.deserialize
(ColumnFamilySerializer.java:117)
at org.apache.cassandra.db.RowMutationSerializer.defreezeTheMaps
(RowMutation.java:385)
at org.apache.cassandra.db.RowMutationSerializer.deserialize
(RowMutation.java:395)
at org.apache.cassandra.db.RowMutationSerializer.deserialize
(RowMutation.java:353)
at org.apache.cassandra.db.RowMutationVerbHandler.doVerb
(RowMutationVerbHandler.java:52)
at org.apache.cassandra.net.MessageDeliveryTask.run
(MessageDeliveryTask.java:63)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
at java.utilconcurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)


It refers to two cfIds - cfId=1004 and cfId=1013. Mutation stages are always
different even for the exceptions appearing within the same millisecond.
As you can see below cfId=004 appears on both nodes several times but at
different times while cfId=0013 appears only once on one node.

It happened as a group within one second on one node and in 5 groups spread
across 45 minutes on another node. I left the first log entry of each group.

xxx.xxx.xxx.140 grep -i cfid -B 1 log/cassandra.log
xxx.xxx.xxx.141 grep -i cfid -B 1 log/cassandra.log
xxx.xxx.xxx.142 grep -i cfid -B 1 log/cassandra.log
xxx.xxxxxx.143 grep -i cfid -B 1 log/cassandra.log


xxx.xxx.xxx.144 grep -i cfid -B 1 log/cassandra.log
ERROR [MutationStage:11] 2011-01-14 15:02:03,911 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException: 
Couldn't find cfId=1004


xxx.xxx.xxx.145 grep -i cfid -B 1 log/cassandra.log
ERROR [MutationStage:1] 2011-01-14 15:02:34,460 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException: 
Couldn't find cfId=1004
--
ERROR [MutationStage:13] 2011-01-14 15:03:28,637 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException: 
Couldn't find cfId=1004
--
ERROR [MutationStage:27] 2011-01-14 15:05:02,513 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException: 
Couldn't find cfId=1004
--
ERROR [MutationStage:4] 2011-01-14 15:12:30,731 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException: 
Couldn't find cfId=1004
--
ERROR [MutationStage:23] 2011-01-14 15:47:03,416 RowMutationVerbHandler.java
(line 83) Error in row mutation
org.apache.cassandra.db.UnserializableColumnFamilyException: 
Couldn't find cfId=1013



Q. What does this mean for the consistency? Am I still within my guarantee of
CL=ONE? 



NOTE: I experienced similar exceptions in 0.7-rc2 but at that time cfIds looked
corrupted. They were random/negative and these exceptions 
were followed by an OOM with an attempt to allocate a huge HeapByteBuffer.

Thank you very much,
Oleg