You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Anthony Molinaro <an...@alumni.caltech.edu> on 2010/04/23 19:30:09 UTC

Odd ring problems with 0.5.1

So I've been trying to migrate off of old ec2 m1.large nodes onto xlarge
nodes so I can get enough breathing room to then do an upgrade to 0.6.x
(I can't keep the large nodes up long enough, so I spend all my time
restarting and trying to move data, so can get all the packages I would
need for 0.6.x updated).

Anyway, I've been bootstrapping in new nodes in between old nodes then
running decommission.  Sometimes it seems to work, but I've been noticing
some oddness.

Some nodes appear in the ring from some nodes, but not others.  Right
now I have 14 nodes, 10 of those nodes have the same output of a
nodeprobe ring, the other 4 are missing one node.  Also, I have a
couple nodes that when I try to bootstrap them with an InitialToken
they get put into yet another ring with only a few nodes including
nodes that I called removetoken on.  They all have the same seed node
and it has not gone down.  The seed node has all nodes.

Anyone seen this?  How can I get those 4 nodes to see the missing node?
If a known issue has it been fixed in 0.6 or newer?

Thanks,

-Anthony

-- 
------------------------------------------------------------------------
Anthony Molinaro                           <an...@alumni.caltech.edu>

Re: Odd ring problems with 0.5.1

Posted by Anthony Molinaro <an...@alumni.caltech.edu>.

Turns out I needed to shut everything down completely, then start it all up
a rolling restart was still resulting in some nodes being confused about
what ring they were in.

I think the moral of all this, is any changes to the seed node must result
in a full restart of your cluster.  Also any use of removetoken is perilous.

Good news is I'm off of the old nodes, I'll need to figure out a way to
bulk load the data from some of the old sstables, but I think sstable2json
and a quick perl script to load might work out.

Then after that upgrade to 0.6.x

-Anthony

On Fri, Apr 23, 2010 at 02:22:11PM -0700, Anthony Molinaro wrote:
> 
> On Fri, Apr 23, 2010 at 01:17:21PM -0500, Jonathan Ellis wrote:
> > On Fri, Apr 23, 2010 at 1:12 PM, Anthony Molinaro
> > <an...@alumni.caltech.edu> wrote:
> > > I'm not sure how it would get this, maybe I need to restart my seed node?
> > 
> > It's worth a try.  Sounds like you found an unusual bug in gossip.
> 
> Damn, restarting the seed, resulted in the seed coming up in a new ring
> with 3 nodes which have been decommissioned.  Seems like restarting other
> nodes brings them into that ring (or at least the first few seem to be in
> the new ring).  I'll restart them all to see if I can't get to a consistent
> ring.  You know what might have happened, I changed the ip of the seed host
> in my /etc/hosts before starting to decommission, I bet I should have then
> restarted everything.  Oh well, hopefully most of my data is still viable.
> 
> I do still have all the old sstables lying around, can I just sstable2json
> then json2sstable and have it reload them?  Or do the sstables need to be
> keyed to the keyrange?  I guess I can sstable2json then create an import
> script to insert them via thrift?
> 
> > > When I run nodeprobe ring on the seed I don't see any of the hosts I
> > > decommissioned, but maybe they are still listed there somewhere?
> > 
> > 0.5 does leave decommissioned host information in gossip, but I'm not
> > sure how that applies to this problem.
> 
> I bet that was a red herring, I'm pretty convinced now this was all a
> result of me now restarting all the nodes after making a change to the
> seed.
> 
> -Anthony
> 
> -- 
> ------------------------------------------------------------------------
> Anthony Molinaro                           <an...@alumni.caltech.edu>

-- 
------------------------------------------------------------------------
Anthony Molinaro                           <an...@alumni.caltech.edu>

Re: Odd ring problems with 0.5.1

Posted by Anthony Molinaro <an...@alumni.caltech.edu>.

On Fri, Apr 23, 2010 at 01:17:21PM -0500, Jonathan Ellis wrote:
> On Fri, Apr 23, 2010 at 1:12 PM, Anthony Molinaro
> <an...@alumni.caltech.edu> wrote:
> > I'm not sure how it would get this, maybe I need to restart my seed node?
> 
> It's worth a try.  Sounds like you found an unusual bug in gossip.

Damn, restarting the seed, resulted in the seed coming up in a new ring
with 3 nodes which have been decommissioned.  Seems like restarting other
nodes brings them into that ring (or at least the first few seem to be in
the new ring).  I'll restart them all to see if I can't get to a consistent
ring.  You know what might have happened, I changed the ip of the seed host
in my /etc/hosts before starting to decommission, I bet I should have then
restarted everything.  Oh well, hopefully most of my data is still viable.

I do still have all the old sstables lying around, can I just sstable2json
then json2sstable and have it reload them?  Or do the sstables need to be
keyed to the keyrange?  I guess I can sstable2json then create an import
script to insert them via thrift?

> > When I run nodeprobe ring on the seed I don't see any of the hosts I
> > decommissioned, but maybe they are still listed there somewhere?
> 
> 0.5 does leave decommissioned host information in gossip, but I'm not
> sure how that applies to this problem.

I bet that was a red herring, I'm pretty convinced now this was all a
result of me now restarting all the nodes after making a change to the
seed.

-Anthony

-- 
------------------------------------------------------------------------
Anthony Molinaro                           <an...@alumni.caltech.edu>

Re: Odd ring problems with 0.5.1

Posted by Jonathan Ellis <jb...@gmail.com>.

On Fri, Apr 23, 2010 at 1:12 PM, Anthony Molinaro
<an...@alumni.caltech.edu> wrote:
> I'm not sure how it would get this, maybe I need to restart my seed node?

It's worth a try.  Sounds like you found an unusual bug in gossip.

> When I run nodeprobe ring on the seed I don't see any of the hosts I
> decommissioned, but maybe they are still listed there somewhere?

0.5 does leave decommissioned host information in gossip, but I'm not
sure how that applies to this problem.

Re: Odd ring problems with 0.5.1

Posted by Anthony Molinaro <an...@alumni.caltech.edu>.

On Fri, Apr 23, 2010 at 12:41:17PM -0500, Jonathan Ellis wrote:
> On Fri, Apr 23, 2010 at 12:30 PM, Anthony Molinaro
> <an...@alumni.caltech.edu> wrote:
> > Some nodes appear in the ring from some nodes, but not others.  Right
> > now I have 14 nodes, 10 of those nodes have the same output of a
> > nodeprobe ring, the other 4 are missing one node.
> 
> What's the history of the missing node?  Is it a newly bootstrapped one?

Yes, newly bootstrapped with an initial token.

> > Also, I have a
> > couple nodes that when I try to bootstrap them with an InitialToken
> > they get put into yet another ring with only a few nodes including
> > nodes that I called removetoken on.  They all have the same seed node
> > and it has not gone down.  The seed node has all nodes.
> >
> > Anyone seen this?
> 
> The only time I have seen multiple rings is when some nodes have been
> configured with a different seed than others.

Yeah, that's the first thing I checked, but the seed is the same.  The
odd thing is also when I bootstrap a new node, it still finds hosts
which are no longer part of the cluster and puts them in the cluster.

I'm not sure how it would get this, maybe I need to restart my seed node?
When I run nodeprobe ring on the seed I don't see any of the hosts I
decommissioned, but maybe they are still listed there somewhere?

-Anthony

-- 
------------------------------------------------------------------------
Anthony Molinaro                           <an...@alumni.caltech.edu>

Re: Odd ring problems with 0.5.1

Posted by Jonathan Ellis <jb...@gmail.com>.

On Fri, Apr 23, 2010 at 12:30 PM, Anthony Molinaro
<an...@alumni.caltech.edu> wrote:
> Some nodes appear in the ring from some nodes, but not others.  Right
> now I have 14 nodes, 10 of those nodes have the same output of a
> nodeprobe ring, the other 4 are missing one node.

What's the history of the missing node?  Is it a newly bootstrapped one?

> Also, I have a
> couple nodes that when I try to bootstrap them with an InitialToken
> they get put into yet another ring with only a few nodes including
> nodes that I called removetoken on.  They all have the same seed node
> and it has not gone down.  The seed node has all nodes.
>
> Anyone seen this?

The only time I have seen multiple rings is when some nodes have been
configured with a different seed than others.

-Jonathan