You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Robert Wille <rw...@fold3.com> on 2014/12/09 18:41:11 UTC

Observations/concerns with repair and hinted handoff

I have spent a lot of time working with single-node, RF=1 clusters in my development. Before I deploy a cluster to our live environment, I have spent some time learning how to work with a multi-node cluster with RF=3. There were some surprises. I’m wondering if people here can enlighten me. I don’t exactly have that warm, fuzzy feeling.

I created a three-node cluster with RF=3. I then wrote to the cluster pretty heavily to cause some dropped mutation messages. The dropped messages didn’t trickle in, but came in a burst. I suspect full GC is the culprit, but I don’t really know. Anyway, I ended up with 17197 dropped mutation messages on node 1, 6422 on node 2, and none on node 3. In order to learn about repair, I waited for compaction to finish doing its thing, recorded the size and estimated number of keys for each table, started up repair (nodetool repair <keyspace>) on all three nodes, and waited for it to complete before doing anything else (even reads). When repair and compaction were done, I checked the size and estimated number of keys for each table. All tables on all nodes grew in size and estimated number of keys. The estimated number of keys for each node grew by 65k, 272k and 247k (.2%, .7% and .6%) for nodes 1, 2 and 3 respectively. I expected some growth, but that’s significantly more new keys than I had dropped mutation messages. I also expected the most new data on node 1, and none on node 3, which didn’t come close to what actually happened. Perhaps a mutation message contains more than one record? Perhaps the dropped mutation message counter is incremented on the coordinator, not the node that was overloaded?

I repeated repair, and the second time around the tables remained unchanged, as expected. I would hope that repair wouldn’t do anything to the tables if they were in sync.

Just to be clear, I’m not overly concerned about the unexpected increase in number of keys. I’m pretty sure that repair did the needful thing and did bring the nodes in sync. The unexpected results more likely indicates that I’m ignorant, and it really bothers me when I don’t understand something. If you have any insights, I’d appreciate them.

One of the dismaying things about repair was that the first time around it took about 4 hours, with a completely idle cluster (except for repairs, of course), and only 6 GB of data on each node. I can bootstrap a node with 6 GB of data in a couple of minutes. That makes repair something like 50 to 100 times more expensive than bootstrapping. I know I should run repair on one node at a time, but even if you divide by three, that’s still a horrifically long time for such a small amount of data. The second time around, repair only took 30 minutes. That’s much better, but best-case is still about 10x longer than bootstrapping. Should repair really be taking this long? When I have 300 GB of data, is a best-case repair going to take 25 hours, and a repair with a modest amount of work more than 100 hours? My records are quite small. Those 6 GB contain almost 40 million partitions.

Following my repair experiment, I added a fourth node, and then tried killing a node and importing a bunch of data while the node was down. As far as repair is concerned, this seems to work fine (although again, glacially). However, I noticed that hinted handoff doesn’t seem to be working. I added several million records (with consistency=one), and nothing appeared in system.hints (du -hs showed a few dozen K bytes), nor did I get any pending Hinted Handoff tasks in the Thread Pool Stats. When I started up the down node (less than 3 hours later), the missed data didn’t appear to get sent to it. The tables did not grow, compaction events didn’t schedule, and there wasn’t any appreciable CPU utilization by the cluster. With millions of records that were missed while it was down, I should have noticed something if it actually was replaying the hints. Is there some magic setting to turn on hinted handoffs? Were there too many hints and so it just deleted them? My assumption is that if hinted handoff is working, then my need for repair should be much less, which given my experience so far, would be a really good thing.

Given the horrifically long time it takes to repair a node, and hinted handoff apparently not working, if a node goes down, is it better to bootstrap a new one than to repair the node that went down? I would expect that even if I chose to bootstrap a new node, it would need to be repaired anyway, since it would probably miss writes while bootstrapping.

Thanks in advance

Robert