You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Anthony Molinaro <an...@alumni.caltech.edu> on 2010/07/15 00:58:47 UTC

Bootstrap question

Hi,

  I have a 0.6.3 cluster which contains 6 nodes.  I added 6 new nodes
by setting AutoBootstrap to true and setting an InitialToken on each new
node, then waiting for the "Bootstrapping" message in the log before
starting another.  Then I've been watching the logs on the old boxes
waiting to see AntiCompaction messages.

Unfortunately after several hours I only see 1 of the 6 old nodes has
the AntiCompaction message.  The new nodes are placed such that every
old node should have some data pulled from it.  Why don't I see more
Anti Compaction messages?  Are there other things I should be looking
at?

Thanks,

-Anthony

-- 
------------------------------------------------------------------------
Anthony Molinaro                           <an...@alumni.caltech.edu>

Re: Bootstrap question

Posted by Anthony Molinaro <an...@alumni.caltech.edu>.

Oh, and looking at the load on the new machines it appears that

New 2 and New 6 have gotten some data (although neither is in the ring
yet).   Not sure if that clears anything up though.

-Anthony

On Thu, Jul 15, 2010 at 01:28:06PM -0700, Anthony Molinaro wrote:
> This is a cluster which is horribly imbalanced because I didn't assign
> initial tokens, so I'm adding 6 nodes with tokens according to the operations
> page (ie, i * (2^127/N) with N = 6).
> 
> So here's what the ring will look like when bootstrap finishes
> 
>                      151901684708361811491018697633480111658
>    Old 1  673.76 GB    1620761242680682425026573496599110901
>    Old 2  204.90 GB   10637639655367601517656788464652024082
>    Old 3  139.82 GB   21604748163853165203168832909938143241
>    New 1              28356863910078205288614550619314017621
>    Old 4  250.61 GB   46182405069378676149148922496055212595
>    New 2              56713727820156410577229101238628035242
>    New 3              85070591730234615865843651857942052863
>    Old 5  572.91 GB  103509928471922053310251250943275708086
>    New 4             113427455640312821154458202477256070485
>    New 5             141784319550391026443072753096570088106
>    Old 6  739.61 GB  151901684708361811491018697633480111658
>    New 6             170141183460469231731687303715884105728
> 
> So from this it seems like I should see anti-compacition on old nodes
> 4, 5, 6 and 1.
> 
> Looking now, it seem that 1 and 6 have had some anti-compaction
> happen, node 4 has
> 
>  INFO [STREAM-STAGE:1] 2010-07-14 20:53:26,579 StreamOut.java (line 95)
>   Performing anticompaction ...
> 
> in the log but not a corresponding
> 
> CompactionManager.java (line 339) AntiCompacting [..]
> 
> line
> 
> Node 5 has nothing in its logs about anti-compaction.
> 
> Is the fact that 2 new nodes are in the range messing it up?  And if so
> how do I recover (I'm thinking, shutdown new nodes 2,3,4,5, the bringing
> up nodes 2,4, waiting for them to finish, then bringing up 3,5?).
> 
> -Anthony
> 
> 
> On Wed, Jul 14, 2010 at 08:45:45PM -0500, Jonathan Ellis wrote:
> > Each node logs what token it is going to bootstrap to.  Who owns the
> > ranges that contain those tokens?
> > 
> > On Wed, Jul 14, 2010 at 5:58 PM, Anthony Molinaro
> > <an...@alumni.caltech.edu> wrote:
> > > Hi,
> > >
> > >  I have a 0.6.3 cluster which contains 6 nodes.  I added 6 new nodes
> > > by setting AutoBootstrap to true and setting an InitialToken on each new
> > > node, then waiting for the "Bootstrapping" message in the log before
> > > starting another.  Then I've been watching the logs on the old boxes
> > > waiting to see AntiCompaction messages.
> > >
> > > Unfortunately after several hours I only see 1 of the 6 old nodes has
> > > the AntiCompaction message.  The new nodes are placed such that every
> > > old node should have some data pulled from it.  Why don't I see more
> > > Anti Compaction messages?  Are there other things I should be looking
> > > at?
> > >
> > > Thanks,
> > >
> > > -Anthony
> > >
> > > --
> > > ------------------------------------------------------------------------
> > > Anthony Molinaro                           <an...@alumni.caltech.edu>
> > >
> > 
> > 
> > 
> > -- 
> > Jonathan Ellis
> > Project Chair, Apache Cassandra
> > co-founder of Riptano, the source for professional Cassandra support
> > http://riptano.com
> 
> -- 
> ------------------------------------------------------------------------
> Anthony Molinaro                           <an...@alumni.caltech.edu>

-- 
------------------------------------------------------------------------
Anthony Molinaro                           <an...@alumni.caltech.edu>

Re: Bootstrap question

Posted by Anthony Molinaro <an...@alumni.caltech.edu>.

On Thu, Jul 15, 2010 at 10:45:08PM -0700, Anthony Molinaro wrote:
> Is there something else I should try?  The only thing I can think of
> is deleting the system directory on the new node, and restarting, so
> I'll try that and see if it does anything.

So I tried this, it didn't do anything.  There is no data being transfered
to the new node,  any ideas?  Is there no way to get a new node in this 
cluster at this point?

-Anthony

-- 
------------------------------------------------------------------------
Anthony Molinaro                           <an...@alumni.caltech.edu>

Re: Bootstrap question

Posted by Gary Dusbabek <gd...@gmail.com>.

On Wed, Jul 21, 2010 at 14:14, Anthony Molinaro
<an...@alumni.caltech.edu> wrote:
> Sure, looks like that's in 0.6.4, so I'll probably just rebuild my server
> based on the 0.6 branch, unless you want me to test just the patch for
> 1221?  Most likely won't get a chance to try until tomorrow, so let me
> know.
>

Either way works for me.

Re: Bootstrap question

Posted by Anthony Molinaro <an...@alumni.caltech.edu>.

Sure, looks like that's in 0.6.4, so I'll probably just rebuild my server
based on the 0.6 branch, unless you want me to test just the patch for
1221?  Most likely won't get a chance to try until tomorrow, so let me
know.

Thanks,

-Anthony

On Wed, Jul 21, 2010 at 06:58:13AM -0500, Gary Dusbabek wrote:
> Anthony,
> 
> I think you're seeing the results of CASSANDRA-1221.  Each node has
> two connections with its peers.  One connection is used for gossip,
> the other for exchanging commands.  What you see with 1221 is the
> command socket getting 'stuck' after a peer is convicted by gossip and
> then recovers.  It doesn't happen every time, but it happens much of
> the time, especially with streaming.  I was able to reproduce this at
> will using loadbalance, but never tried it under bootstrap (where the
> bootstrapping IP was previously visible on the cluster), but it seems
> very plausible.
> 
> Any chance you could apply the patch for 1221 and test?
> 
> Gary.
> 
> On Tue, Jul 20, 2010 at 16:45, Anthony Molinaro
> <an...@alumni.caltech.edu> wrote:
> > I see this in the old nodes
> >
> > DEBUG [WRITE-/10.220.198.15] 2010-07-20 21:15:50,366 OutboundTcpConnection.java (line 142) attempting to connect to /10.220.198.15
> > INFO [GMFD:1] 2010-07-20 21:15:50,391 Gossiper.java (line 586) Node /10.220.198.15 is now part of the cluster
> > INFO [GMFD:1] 2010-07-20 21:15:51,369 Gossiper.java (line 578) InetAddress /10.220.198.15 is now UP
> > INFO [HINTED-HANDOFF-POOL:1] 2010-07-20 21:15:51,369 HintedHandOffManager.java (line 153) Started hinted handoff for endPoint /10.220.198.15
> > INFO [HINTED-HANDOFF-POOL:1] 2010-07-20 21:15:51,371 HintedHandOffManager.java (line 210) Finished hinted handoff of 0 rows to endpoint /10.220.198.15
> > DEBUG [GMFD:1] 2010-07-20 21:17:20,551 StorageService.java (line 512) Node
> > /10.220.198.15 state bootstrapping, token 28356863910078205288614550619314017621
> > DEBUG [GMFD:1] 2010-07-20 21:17:20,656
> > StorageService.java (line 746) Pending ranges:
> > /10.220.198.15:(21604748163853165203168832909938143241,28356863910078205288614550619314017621]
> > /10.220.198.15:(10637639655367601517656788464652024082,21604748163853165203168832909938143241]
> >
> > 10.220.198.15 is the new node
> >
> > The key ranges seem to be for the primary and replica ranges.
> >
> > So after that, I would expect some AntiCompaction to happen on some of the
> > other nodes, but I don't see anything.
> >
> > Any clues from that output?
> >
> > I did not muck around with the Location tables.
> >
> > -Anthony
> >
> > On Mon, Jul 19, 2010 at 09:36:22PM -0500, Jonathan Ellis wrote:
> >> What gets logged on the old nodes at debug, when you try to add a
> >> single new machine after a full cluster restart?
> >>
> >> Removing Location would blow away the nodes' token information...  It
> >> should be safe if you set the InitialToken to what it used to be on
> >> each machine before bringing it up after nuking those.  Better
> >> snapshot the system keyspace first, just in case.
> >>
> >> On Sun, Jul 18, 2010 at 2:01 PM, Anthony Molinaro
> >> <an...@alumni.caltech.edu> wrote:
> >> > Yeah, I tried all that already and it didn't seem to work, no new nodes
> >> > will bootstrap, which makes me think there's some saved state somewhere,
> >> > preventing a new node from bootstrapping.  I think maybe the Location
> >> > sstables?  Is it safe to nuke those on all hosts and restart everything?
> >> > (I just don't want to lose actual data).
> >> >
> >> > Thanks for the ideas,
> >> >
> >> > -Anthony
> >> >
> >> > On Sun, Jul 18, 2010 at 08:09:45PM +0300, shimi wrote:
> >> >> If I have problems with never ending bootstraping I do the following. I try
> >> >> each one if it doesn't help I try the next. It might not be the right thing
> >> >> to do but it worked for me.
> >> >>
> >> >> 1. Restart the bootstraping node
> >> >> 2. If I see streaming 0/xxxx I restart the node and all the streaming nodes
> >> >> 3. Restart all the nodes
> >> >> 4. If there is data in the bootstraing node I delete it before I restart.
> >> >>
> >> >> Good luck
> >> >> Shimi
> >> >>
> >> >> On Sun, Jul 18, 2010 at 12:21 AM, Anthony Molinaro <
> >> >> anthonym@alumni.caltech.edu> wrote:
> >> >>
> >> >> > So still waiting for any sort of answer on this one.  The cluster still
> >> >> > refuses to do anything when I bring up new nodes.  I shut down all the
> >> >> > new nodes and am waiting.  I'm guessing that maybe the old nodes have
> >> >> > some state which needs to get cleared out?  Is there anything I can do
> >> >> > at this point?  Are there alternate strategies for bootstrapping I can
> >> >> > try?  (For instance can I just scp all the sstables to all the new
> >> >> > nodes and do a repair, would that actually work?).
> >> >> >
> >> >> > Anyone seen this sort of issue?  All this is with 0.6.3 so I assume
> >> >> > eventually others will see this issue.
> >> >> >
> >> >> > -Anthony
> >> >> >
> >> >> > On Thu, Jul 15, 2010 at 10:45:08PM -0700, Anthony Molinaro wrote:
> >> >> > > Okay, so things were pretty messed up.  I shut down all the new nodes,
> >> >> > > then the old nodes started doing the half the ring is down garbage which
> >> >> > > pretty much requires a full restart of everything.  So I had to shut
> >> >> > > everything down, then bring the seed back, then the rest of the nodes,
> >> >> > > so they finally all agreed on the ring again.
> >> >> > >
> >> >> > > Then I started one of the new nodes, and have been watching the logs, so
> >> >> > > far 2 hours since the "Bootstrapping" message appeared in the new
> >> >> > > log and nothing has happened.  No anticompaction messages anywhere,
> >> >> > there's
> >> >> > > one node compacting, but its on the other end of the ring, so no where
> >> >> > near
> >> >> > > that new node.  I'm wondering if it will ever get data at this point.
> >> >> > >
> >> >> > > Is there something else I should try?  The only thing I can think of
> >> >> > > is deleting the system directory on the new node, and restarting, so
> >> >> > > I'll try that and see if it does anything.
> >> >> > >
> >> >> > > -Anthony
> >> >> > >
> >> >> > > On Thu, Jul 15, 2010 at 03:43:49PM -0500, Jonathan Ellis wrote:
> >> >> > > > On Thu, Jul 15, 2010 at 3:28 PM, Anthony Molinaro
> >> >> > > > <an...@alumni.caltech.edu> wrote:
> >> >> > > > > Is the fact that 2 new nodes are in the range messing it up?
> >> >> > > >
> >> >> > > > Probably.
> >> >> > > >
> >> >> > > > >  And if so
> >> >> > > > > how do I recover (I'm thinking, shutdown new nodes 2,3,4,5, the
> >> >> > bringing
> >> >> > > > > up nodes 2,4, waiting for them to finish, then bringing up 3,5?).
> >> >> > > >
> >> >> > > > Yes.
> >> >> > > >
> >> >> > > > You might have to restart the old nodes too to clear out the confusion.
> >> >> > > >
> >> >> > > > --
> >> >> > > > Jonathan Ellis
> >> >> > > > Project Chair, Apache Cassandra
> >> >> > > > co-founder of Riptano, the source for professional Cassandra support
> >> >> > > > http://riptano.com
> >> >> > >
> >> >> > > --
> >> >> > > ------------------------------------------------------------------------
> >> >> > > Anthony Molinaro                           <an...@alumni.caltech.edu>
> >> >> >
> >> >> > --
> >> >> > ------------------------------------------------------------------------
> >> >> > Anthony Molinaro                           <an...@alumni.caltech.edu>
> >> >> >
> >> >
> >> > --
> >> > ------------------------------------------------------------------------
> >> > Anthony Molinaro                           <an...@alumni.caltech.edu>
> >> >
> >>
> >>
> >>
> >> --
> >> Jonathan Ellis
> >> Project Chair, Apache Cassandra
> >> co-founder of Riptano, the source for professional Cassandra support
> >> http://riptano.com
> >
> > --
> > ------------------------------------------------------------------------
> > Anthony Molinaro                           <an...@alumni.caltech.edu>
> >

-- 
------------------------------------------------------------------------
Anthony Molinaro                           <an...@alumni.caltech.edu>

Re: Bootstrap question

Posted by Gary Dusbabek <gd...@gmail.com>.

Anthony,

I think you're seeing the results of CASSANDRA-1221.  Each node has
two connections with its peers.  One connection is used for gossip,
the other for exchanging commands.  What you see with 1221 is the
command socket getting 'stuck' after a peer is convicted by gossip and
then recovers.  It doesn't happen every time, but it happens much of
the time, especially with streaming.  I was able to reproduce this at
will using loadbalance, but never tried it under bootstrap (where the
bootstrapping IP was previously visible on the cluster), but it seems
very plausible.

Any chance you could apply the patch for 1221 and test?

Gary.

On Tue, Jul 20, 2010 at 16:45, Anthony Molinaro
<an...@alumni.caltech.edu> wrote:
> I see this in the old nodes
>
> DEBUG [WRITE-/10.220.198.15] 2010-07-20 21:15:50,366 OutboundTcpConnection.java (line 142) attempting to connect to /10.220.198.15
> INFO [GMFD:1] 2010-07-20 21:15:50,391 Gossiper.java (line 586) Node /10.220.198.15 is now part of the cluster
> INFO [GMFD:1] 2010-07-20 21:15:51,369 Gossiper.java (line 578) InetAddress /10.220.198.15 is now UP
> INFO [HINTED-HANDOFF-POOL:1] 2010-07-20 21:15:51,369 HintedHandOffManager.java (line 153) Started hinted handoff for endPoint /10.220.198.15
> INFO [HINTED-HANDOFF-POOL:1] 2010-07-20 21:15:51,371 HintedHandOffManager.java (line 210) Finished hinted handoff of 0 rows to endpoint /10.220.198.15
> DEBUG [GMFD:1] 2010-07-20 21:17:20,551 StorageService.java (line 512) Node
> /10.220.198.15 state bootstrapping, token 28356863910078205288614550619314017621
> DEBUG [GMFD:1] 2010-07-20 21:17:20,656
> StorageService.java (line 746) Pending ranges:
> /10.220.198.15:(21604748163853165203168832909938143241,28356863910078205288614550619314017621]
> /10.220.198.15:(10637639655367601517656788464652024082,21604748163853165203168832909938143241]
>
> 10.220.198.15 is the new node
>
> The key ranges seem to be for the primary and replica ranges.
>
> So after that, I would expect some AntiCompaction to happen on some of the
> other nodes, but I don't see anything.
>
> Any clues from that output?
>
> I did not muck around with the Location tables.
>
> -Anthony
>
> On Mon, Jul 19, 2010 at 09:36:22PM -0500, Jonathan Ellis wrote:
>> What gets logged on the old nodes at debug, when you try to add a
>> single new machine after a full cluster restart?
>>
>> Removing Location would blow away the nodes' token information...  It
>> should be safe if you set the InitialToken to what it used to be on
>> each machine before bringing it up after nuking those.  Better
>> snapshot the system keyspace first, just in case.
>>
>> On Sun, Jul 18, 2010 at 2:01 PM, Anthony Molinaro
>> <an...@alumni.caltech.edu> wrote:
>> > Yeah, I tried all that already and it didn't seem to work, no new nodes
>> > will bootstrap, which makes me think there's some saved state somewhere,
>> > preventing a new node from bootstrapping.  I think maybe the Location
>> > sstables?  Is it safe to nuke those on all hosts and restart everything?
>> > (I just don't want to lose actual data).
>> >
>> > Thanks for the ideas,
>> >
>> > -Anthony
>> >
>> > On Sun, Jul 18, 2010 at 08:09:45PM +0300, shimi wrote:
>> >> If I have problems with never ending bootstraping I do the following. I try
>> >> each one if it doesn't help I try the next. It might not be the right thing
>> >> to do but it worked for me.
>> >>
>> >> 1. Restart the bootstraping node
>> >> 2. If I see streaming 0/xxxx I restart the node and all the streaming nodes
>> >> 3. Restart all the nodes
>> >> 4. If there is data in the bootstraing node I delete it before I restart.
>> >>
>> >> Good luck
>> >> Shimi
>> >>
>> >> On Sun, Jul 18, 2010 at 12:21 AM, Anthony Molinaro <
>> >> anthonym@alumni.caltech.edu> wrote:
>> >>
>> >> > So still waiting for any sort of answer on this one.  The cluster still
>> >> > refuses to do anything when I bring up new nodes.  I shut down all the
>> >> > new nodes and am waiting.  I'm guessing that maybe the old nodes have
>> >> > some state which needs to get cleared out?  Is there anything I can do
>> >> > at this point?  Are there alternate strategies for bootstrapping I can
>> >> > try?  (For instance can I just scp all the sstables to all the new
>> >> > nodes and do a repair, would that actually work?).
>> >> >
>> >> > Anyone seen this sort of issue?  All this is with 0.6.3 so I assume
>> >> > eventually others will see this issue.
>> >> >
>> >> > -Anthony
>> >> >
>> >> > On Thu, Jul 15, 2010 at 10:45:08PM -0700, Anthony Molinaro wrote:
>> >> > > Okay, so things were pretty messed up.  I shut down all the new nodes,
>> >> > > then the old nodes started doing the half the ring is down garbage which
>> >> > > pretty much requires a full restart of everything.  So I had to shut
>> >> > > everything down, then bring the seed back, then the rest of the nodes,
>> >> > > so they finally all agreed on the ring again.
>> >> > >
>> >> > > Then I started one of the new nodes, and have been watching the logs, so
>> >> > > far 2 hours since the "Bootstrapping" message appeared in the new
>> >> > > log and nothing has happened.  No anticompaction messages anywhere,
>> >> > there's
>> >> > > one node compacting, but its on the other end of the ring, so no where
>> >> > near
>> >> > > that new node.  I'm wondering if it will ever get data at this point.
>> >> > >
>> >> > > Is there something else I should try?  The only thing I can think of
>> >> > > is deleting the system directory on the new node, and restarting, so
>> >> > > I'll try that and see if it does anything.
>> >> > >
>> >> > > -Anthony
>> >> > >
>> >> > > On Thu, Jul 15, 2010 at 03:43:49PM -0500, Jonathan Ellis wrote:
>> >> > > > On Thu, Jul 15, 2010 at 3:28 PM, Anthony Molinaro
>> >> > > > <an...@alumni.caltech.edu> wrote:
>> >> > > > > Is the fact that 2 new nodes are in the range messing it up?
>> >> > > >
>> >> > > > Probably.
>> >> > > >
>> >> > > > >  And if so
>> >> > > > > how do I recover (I'm thinking, shutdown new nodes 2,3,4,5, the
>> >> > bringing
>> >> > > > > up nodes 2,4, waiting for them to finish, then bringing up 3,5?).
>> >> > > >
>> >> > > > Yes.
>> >> > > >
>> >> > > > You might have to restart the old nodes too to clear out the confusion.
>> >> > > >
>> >> > > > --
>> >> > > > Jonathan Ellis
>> >> > > > Project Chair, Apache Cassandra
>> >> > > > co-founder of Riptano, the source for professional Cassandra support
>> >> > > > http://riptano.com
>> >> > >
>> >> > > --
>> >> > > ------------------------------------------------------------------------
>> >> > > Anthony Molinaro                           <an...@alumni.caltech.edu>
>> >> >
>> >> > --
>> >> > ------------------------------------------------------------------------
>> >> > Anthony Molinaro                           <an...@alumni.caltech.edu>
>> >> >
>> >
>> > --
>> > ------------------------------------------------------------------------
>> > Anthony Molinaro                           <an...@alumni.caltech.edu>
>> >
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of Riptano, the source for professional Cassandra support
>> http://riptano.com
>
> --
> ------------------------------------------------------------------------
> Anthony Molinaro                           <an...@alumni.caltech.edu>
>

Re: Bootstrap question

Posted by Anthony Molinaro <an...@alumni.caltech.edu>.

I see this in the old nodes

DEBUG [WRITE-/10.220.198.15] 2010-07-20 21:15:50,366 OutboundTcpConnection.java (line 142) attempting to connect to /10.220.198.15
INFO [GMFD:1] 2010-07-20 21:15:50,391 Gossiper.java (line 586) Node /10.220.198.15 is now part of the cluster
INFO [GMFD:1] 2010-07-20 21:15:51,369 Gossiper.java (line 578) InetAddress /10.220.198.15 is now UP
INFO [HINTED-HANDOFF-POOL:1] 2010-07-20 21:15:51,369 HintedHandOffManager.java (line 153) Started hinted handoff for endPoint /10.220.198.15
INFO [HINTED-HANDOFF-POOL:1] 2010-07-20 21:15:51,371 HintedHandOffManager.java (line 210) Finished hinted handoff of 0 rows to endpoint /10.220.198.15
DEBUG [GMFD:1] 2010-07-20 21:17:20,551 StorageService.java (line 512) Node
/10.220.198.15 state bootstrapping, token 28356863910078205288614550619314017621
DEBUG [GMFD:1] 2010-07-20 21:17:20,656
StorageService.java (line 746) Pending ranges:
/10.220.198.15:(21604748163853165203168832909938143241,28356863910078205288614550619314017621]
/10.220.198.15:(10637639655367601517656788464652024082,21604748163853165203168832909938143241]

10.220.198.15 is the new node

The key ranges seem to be for the primary and replica ranges.

So after that, I would expect some AntiCompaction to happen on some of the 
other nodes, but I don't see anything.

Any clues from that output?

I did not muck around with the Location tables.

-Anthony

On Mon, Jul 19, 2010 at 09:36:22PM -0500, Jonathan Ellis wrote:
> What gets logged on the old nodes at debug, when you try to add a
> single new machine after a full cluster restart?
> 
> Removing Location would blow away the nodes' token information...  It
> should be safe if you set the InitialToken to what it used to be on
> each machine before bringing it up after nuking those.  Better
> snapshot the system keyspace first, just in case.
> 
> On Sun, Jul 18, 2010 at 2:01 PM, Anthony Molinaro
> <an...@alumni.caltech.edu> wrote:
> > Yeah, I tried all that already and it didn't seem to work, no new nodes
> > will bootstrap, which makes me think there's some saved state somewhere,
> > preventing a new node from bootstrapping.  I think maybe the Location
> > sstables?  Is it safe to nuke those on all hosts and restart everything?
> > (I just don't want to lose actual data).
> >
> > Thanks for the ideas,
> >
> > -Anthony
> >
> > On Sun, Jul 18, 2010 at 08:09:45PM +0300, shimi wrote:
> >> If I have problems with never ending bootstraping I do the following. I try
> >> each one if it doesn't help I try the next. It might not be the right thing
> >> to do but it worked for me.
> >>
> >> 1. Restart the bootstraping node
> >> 2. If I see streaming 0/xxxx I restart the node and all the streaming nodes
> >> 3. Restart all the nodes
> >> 4. If there is data in the bootstraing node I delete it before I restart.
> >>
> >> Good luck
> >> Shimi
> >>
> >> On Sun, Jul 18, 2010 at 12:21 AM, Anthony Molinaro <
> >> anthonym@alumni.caltech.edu> wrote:
> >>
> >> > So still waiting for any sort of answer on this one.  The cluster still
> >> > refuses to do anything when I bring up new nodes.  I shut down all the
> >> > new nodes and am waiting.  I'm guessing that maybe the old nodes have
> >> > some state which needs to get cleared out?  Is there anything I can do
> >> > at this point?  Are there alternate strategies for bootstrapping I can
> >> > try?  (For instance can I just scp all the sstables to all the new
> >> > nodes and do a repair, would that actually work?).
> >> >
> >> > Anyone seen this sort of issue?  All this is with 0.6.3 so I assume
> >> > eventually others will see this issue.
> >> >
> >> > -Anthony
> >> >
> >> > On Thu, Jul 15, 2010 at 10:45:08PM -0700, Anthony Molinaro wrote:
> >> > > Okay, so things were pretty messed up.  I shut down all the new nodes,
> >> > > then the old nodes started doing the half the ring is down garbage which
> >> > > pretty much requires a full restart of everything.  So I had to shut
> >> > > everything down, then bring the seed back, then the rest of the nodes,
> >> > > so they finally all agreed on the ring again.
> >> > >
> >> > > Then I started one of the new nodes, and have been watching the logs, so
> >> > > far 2 hours since the "Bootstrapping" message appeared in the new
> >> > > log and nothing has happened.  No anticompaction messages anywhere,
> >> > there's
> >> > > one node compacting, but its on the other end of the ring, so no where
> >> > near
> >> > > that new node.  I'm wondering if it will ever get data at this point.
> >> > >
> >> > > Is there something else I should try?  The only thing I can think of
> >> > > is deleting the system directory on the new node, and restarting, so
> >> > > I'll try that and see if it does anything.
> >> > >
> >> > > -Anthony
> >> > >
> >> > > On Thu, Jul 15, 2010 at 03:43:49PM -0500, Jonathan Ellis wrote:
> >> > > > On Thu, Jul 15, 2010 at 3:28 PM, Anthony Molinaro
> >> > > > <an...@alumni.caltech.edu> wrote:
> >> > > > > Is the fact that 2 new nodes are in the range messing it up?
> >> > > >
> >> > > > Probably.
> >> > > >
> >> > > > >  And if so
> >> > > > > how do I recover (I'm thinking, shutdown new nodes 2,3,4,5, the
> >> > bringing
> >> > > > > up nodes 2,4, waiting for them to finish, then bringing up 3,5?).
> >> > > >
> >> > > > Yes.
> >> > > >
> >> > > > You might have to restart the old nodes too to clear out the confusion.
> >> > > >
> >> > > > --
> >> > > > Jonathan Ellis
> >> > > > Project Chair, Apache Cassandra
> >> > > > co-founder of Riptano, the source for professional Cassandra support
> >> > > > http://riptano.com
> >> > >
> >> > > --
> >> > > ------------------------------------------------------------------------
> >> > > Anthony Molinaro                           <an...@alumni.caltech.edu>
> >> >
> >> > --
> >> > ------------------------------------------------------------------------
> >> > Anthony Molinaro                           <an...@alumni.caltech.edu>
> >> >
> >
> > --
> > ------------------------------------------------------------------------
> > Anthony Molinaro                           <an...@alumni.caltech.edu>
> >
> 
> 
> 
> -- 
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com

-- 
------------------------------------------------------------------------
Anthony Molinaro                           <an...@alumni.caltech.edu>

Re: Bootstrap question

Posted by Jonathan Ellis <jb...@gmail.com>.

What gets logged on the old nodes at debug, when you try to add a
single new machine after a full cluster restart?

Removing Location would blow away the nodes' token information...  It
should be safe if you set the InitialToken to what it used to be on
each machine before bringing it up after nuking those.  Better
snapshot the system keyspace first, just in case.

On Sun, Jul 18, 2010 at 2:01 PM, Anthony Molinaro
<an...@alumni.caltech.edu> wrote:
> Yeah, I tried all that already and it didn't seem to work, no new nodes
> will bootstrap, which makes me think there's some saved state somewhere,
> preventing a new node from bootstrapping.  I think maybe the Location
> sstables?  Is it safe to nuke those on all hosts and restart everything?
> (I just don't want to lose actual data).
>
> Thanks for the ideas,
>
> -Anthony
>
> On Sun, Jul 18, 2010 at 08:09:45PM +0300, shimi wrote:
>> If I have problems with never ending bootstraping I do the following. I try
>> each one if it doesn't help I try the next. It might not be the right thing
>> to do but it worked for me.
>>
>> 1. Restart the bootstraping node
>> 2. If I see streaming 0/xxxx I restart the node and all the streaming nodes
>> 3. Restart all the nodes
>> 4. If there is data in the bootstraing node I delete it before I restart.
>>
>> Good luck
>> Shimi
>>
>> On Sun, Jul 18, 2010 at 12:21 AM, Anthony Molinaro <
>> anthonym@alumni.caltech.edu> wrote:
>>
>> > So still waiting for any sort of answer on this one.  The cluster still
>> > refuses to do anything when I bring up new nodes.  I shut down all the
>> > new nodes and am waiting.  I'm guessing that maybe the old nodes have
>> > some state which needs to get cleared out?  Is there anything I can do
>> > at this point?  Are there alternate strategies for bootstrapping I can
>> > try?  (For instance can I just scp all the sstables to all the new
>> > nodes and do a repair, would that actually work?).
>> >
>> > Anyone seen this sort of issue?  All this is with 0.6.3 so I assume
>> > eventually others will see this issue.
>> >
>> > -Anthony
>> >
>> > On Thu, Jul 15, 2010 at 10:45:08PM -0700, Anthony Molinaro wrote:
>> > > Okay, so things were pretty messed up.  I shut down all the new nodes,
>> > > then the old nodes started doing the half the ring is down garbage which
>> > > pretty much requires a full restart of everything.  So I had to shut
>> > > everything down, then bring the seed back, then the rest of the nodes,
>> > > so they finally all agreed on the ring again.
>> > >
>> > > Then I started one of the new nodes, and have been watching the logs, so
>> > > far 2 hours since the "Bootstrapping" message appeared in the new
>> > > log and nothing has happened.  No anticompaction messages anywhere,
>> > there's
>> > > one node compacting, but its on the other end of the ring, so no where
>> > near
>> > > that new node.  I'm wondering if it will ever get data at this point.
>> > >
>> > > Is there something else I should try?  The only thing I can think of
>> > > is deleting the system directory on the new node, and restarting, so
>> > > I'll try that and see if it does anything.
>> > >
>> > > -Anthony
>> > >
>> > > On Thu, Jul 15, 2010 at 03:43:49PM -0500, Jonathan Ellis wrote:
>> > > > On Thu, Jul 15, 2010 at 3:28 PM, Anthony Molinaro
>> > > > <an...@alumni.caltech.edu> wrote:
>> > > > > Is the fact that 2 new nodes are in the range messing it up?
>> > > >
>> > > > Probably.
>> > > >
>> > > > >  And if so
>> > > > > how do I recover (I'm thinking, shutdown new nodes 2,3,4,5, the
>> > bringing
>> > > > > up nodes 2,4, waiting for them to finish, then bringing up 3,5?).
>> > > >
>> > > > Yes.
>> > > >
>> > > > You might have to restart the old nodes too to clear out the confusion.
>> > > >
>> > > > --
>> > > > Jonathan Ellis
>> > > > Project Chair, Apache Cassandra
>> > > > co-founder of Riptano, the source for professional Cassandra support
>> > > > http://riptano.com
>> > >
>> > > --
>> > > ------------------------------------------------------------------------
>> > > Anthony Molinaro                           <an...@alumni.caltech.edu>
>> >
>> > --
>> > ------------------------------------------------------------------------
>> > Anthony Molinaro                           <an...@alumni.caltech.edu>
>> >
>
> --
> ------------------------------------------------------------------------
> Anthony Molinaro                           <an...@alumni.caltech.edu>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Bootstrap question

Posted by Anthony Molinaro <an...@alumni.caltech.edu>.

Yeah, I tried all that already and it didn't seem to work, no new nodes
will bootstrap, which makes me think there's some saved state somewhere,
preventing a new node from bootstrapping.  I think maybe the Location
sstables?  Is it safe to nuke those on all hosts and restart everything?
(I just don't want to lose actual data).

Thanks for the ideas,

-Anthony

On Sun, Jul 18, 2010 at 08:09:45PM +0300, shimi wrote:
> If I have problems with never ending bootstraping I do the following. I try
> each one if it doesn't help I try the next. It might not be the right thing
> to do but it worked for me.
> 
> 1. Restart the bootstraping node
> 2. If I see streaming 0/xxxx I restart the node and all the streaming nodes
> 3. Restart all the nodes
> 4. If there is data in the bootstraing node I delete it before I restart.
> 
> Good luck
> Shimi
> 
> On Sun, Jul 18, 2010 at 12:21 AM, Anthony Molinaro <
> anthonym@alumni.caltech.edu> wrote:
> 
> > So still waiting for any sort of answer on this one.  The cluster still
> > refuses to do anything when I bring up new nodes.  I shut down all the
> > new nodes and am waiting.  I'm guessing that maybe the old nodes have
> > some state which needs to get cleared out?  Is there anything I can do
> > at this point?  Are there alternate strategies for bootstrapping I can
> > try?  (For instance can I just scp all the sstables to all the new
> > nodes and do a repair, would that actually work?).
> >
> > Anyone seen this sort of issue?  All this is with 0.6.3 so I assume
> > eventually others will see this issue.
> >
> > -Anthony
> >
> > On Thu, Jul 15, 2010 at 10:45:08PM -0700, Anthony Molinaro wrote:
> > > Okay, so things were pretty messed up.  I shut down all the new nodes,
> > > then the old nodes started doing the half the ring is down garbage which
> > > pretty much requires a full restart of everything.  So I had to shut
> > > everything down, then bring the seed back, then the rest of the nodes,
> > > so they finally all agreed on the ring again.
> > >
> > > Then I started one of the new nodes, and have been watching the logs, so
> > > far 2 hours since the "Bootstrapping" message appeared in the new
> > > log and nothing has happened.  No anticompaction messages anywhere,
> > there's
> > > one node compacting, but its on the other end of the ring, so no where
> > near
> > > that new node.  I'm wondering if it will ever get data at this point.
> > >
> > > Is there something else I should try?  The only thing I can think of
> > > is deleting the system directory on the new node, and restarting, so
> > > I'll try that and see if it does anything.
> > >
> > > -Anthony
> > >
> > > On Thu, Jul 15, 2010 at 03:43:49PM -0500, Jonathan Ellis wrote:
> > > > On Thu, Jul 15, 2010 at 3:28 PM, Anthony Molinaro
> > > > <an...@alumni.caltech.edu> wrote:
> > > > > Is the fact that 2 new nodes are in the range messing it up?
> > > >
> > > > Probably.
> > > >
> > > > >  And if so
> > > > > how do I recover (I'm thinking, shutdown new nodes 2,3,4,5, the
> > bringing
> > > > > up nodes 2,4, waiting for them to finish, then bringing up 3,5?).
> > > >
> > > > Yes.
> > > >
> > > > You might have to restart the old nodes too to clear out the confusion.
> > > >
> > > > --
> > > > Jonathan Ellis
> > > > Project Chair, Apache Cassandra
> > > > co-founder of Riptano, the source for professional Cassandra support
> > > > http://riptano.com
> > >
> > > --
> > > ------------------------------------------------------------------------
> > > Anthony Molinaro                           <an...@alumni.caltech.edu>
> >
> > --
> > ------------------------------------------------------------------------
> > Anthony Molinaro                           <an...@alumni.caltech.edu>
> >

-- 
------------------------------------------------------------------------
Anthony Molinaro                           <an...@alumni.caltech.edu>

Re: Bootstrap question

Posted by shimi <sh...@gmail.com>.

If I have problems with never ending bootstraping I do the following. I try
each one if it doesn't help I try the next. It might not be the right thing
to do but it worked for me.

1. Restart the bootstraping node
2. If I see streaming 0/xxxx I restart the node and all the streaming nodes
3. Restart all the nodes
4. If there is data in the bootstraing node I delete it before I restart.

Good luck
Shimi

On Sun, Jul 18, 2010 at 12:21 AM, Anthony Molinaro <
anthonym@alumni.caltech.edu> wrote:

> So still waiting for any sort of answer on this one.  The cluster still
> refuses to do anything when I bring up new nodes.  I shut down all the
> new nodes and am waiting.  I'm guessing that maybe the old nodes have
> some state which needs to get cleared out?  Is there anything I can do
> at this point?  Are there alternate strategies for bootstrapping I can
> try?  (For instance can I just scp all the sstables to all the new
> nodes and do a repair, would that actually work?).
>
> Anyone seen this sort of issue?  All this is with 0.6.3 so I assume
> eventually others will see this issue.
>
> -Anthony
>
> On Thu, Jul 15, 2010 at 10:45:08PM -0700, Anthony Molinaro wrote:
> > Okay, so things were pretty messed up.  I shut down all the new nodes,
> > then the old nodes started doing the half the ring is down garbage which
> > pretty much requires a full restart of everything.  So I had to shut
> > everything down, then bring the seed back, then the rest of the nodes,
> > so they finally all agreed on the ring again.
> >
> > Then I started one of the new nodes, and have been watching the logs, so
> > far 2 hours since the "Bootstrapping" message appeared in the new
> > log and nothing has happened.  No anticompaction messages anywhere,
> there's
> > one node compacting, but its on the other end of the ring, so no where
> near
> > that new node.  I'm wondering if it will ever get data at this point.
> >
> > Is there something else I should try?  The only thing I can think of
> > is deleting the system directory on the new node, and restarting, so
> > I'll try that and see if it does anything.
> >
> > -Anthony
> >
> > On Thu, Jul 15, 2010 at 03:43:49PM -0500, Jonathan Ellis wrote:
> > > On Thu, Jul 15, 2010 at 3:28 PM, Anthony Molinaro
> > > <an...@alumni.caltech.edu> wrote:
> > > > Is the fact that 2 new nodes are in the range messing it up?
> > >
> > > Probably.
> > >
> > > >  And if so
> > > > how do I recover (I'm thinking, shutdown new nodes 2,3,4,5, the
> bringing
> > > > up nodes 2,4, waiting for them to finish, then bringing up 3,5?).
> > >
> > > Yes.
> > >
> > > You might have to restart the old nodes too to clear out the confusion.
> > >
> > > --
> > > Jonathan Ellis
> > > Project Chair, Apache Cassandra
> > > co-founder of Riptano, the source for professional Cassandra support
> > > http://riptano.com
> >
> > --
> > ------------------------------------------------------------------------
> > Anthony Molinaro                           <an...@alumni.caltech.edu>
>
> --
> ------------------------------------------------------------------------
> Anthony Molinaro                           <an...@alumni.caltech.edu>
>

Re: Bootstrap question

Posted by Anthony Molinaro <an...@alumni.caltech.edu>.

So still waiting for any sort of answer on this one.  The cluster still
refuses to do anything when I bring up new nodes.  I shut down all the
new nodes and am waiting.  I'm guessing that maybe the old nodes have
some state which needs to get cleared out?  Is there anything I can do
at this point?  Are there alternate strategies for bootstrapping I can
try?  (For instance can I just scp all the sstables to all the new
nodes and do a repair, would that actually work?).

Anyone seen this sort of issue?  All this is with 0.6.3 so I assume
eventually others will see this issue.

-Anthony

On Thu, Jul 15, 2010 at 10:45:08PM -0700, Anthony Molinaro wrote:
> Okay, so things were pretty messed up.  I shut down all the new nodes,
> then the old nodes started doing the half the ring is down garbage which
> pretty much requires a full restart of everything.  So I had to shut
> everything down, then bring the seed back, then the rest of the nodes,
> so they finally all agreed on the ring again.
> 
> Then I started one of the new nodes, and have been watching the logs, so
> far 2 hours since the "Bootstrapping" message appeared in the new
> log and nothing has happened.  No anticompaction messages anywhere, there's
> one node compacting, but its on the other end of the ring, so no where near
> that new node.  I'm wondering if it will ever get data at this point.
> 
> Is there something else I should try?  The only thing I can think of
> is deleting the system directory on the new node, and restarting, so
> I'll try that and see if it does anything.
> 
> -Anthony
> 
> On Thu, Jul 15, 2010 at 03:43:49PM -0500, Jonathan Ellis wrote:
> > On Thu, Jul 15, 2010 at 3:28 PM, Anthony Molinaro
> > <an...@alumni.caltech.edu> wrote:
> > > Is the fact that 2 new nodes are in the range messing it up?
> > 
> > Probably.
> > 
> > >  And if so
> > > how do I recover (I'm thinking, shutdown new nodes 2,3,4,5, the bringing
> > > up nodes 2,4, waiting for them to finish, then bringing up 3,5?).
> > 
> > Yes.
> > 
> > You might have to restart the old nodes too to clear out the confusion.
> > 
> > -- 
> > Jonathan Ellis
> > Project Chair, Apache Cassandra
> > co-founder of Riptano, the source for professional Cassandra support
> > http://riptano.com
> 
> -- 
> ------------------------------------------------------------------------
> Anthony Molinaro                           <an...@alumni.caltech.edu>

-- 
------------------------------------------------------------------------
Anthony Molinaro                           <an...@alumni.caltech.edu>

Re: Bootstrap question

Posted by Anthony Molinaro <an...@alumni.caltech.edu>.

Okay, so things were pretty messed up.  I shut down all the new nodes,
then the old nodes started doing the half the ring is down garbage which
pretty much requires a full restart of everything.  So I had to shut
everything down, then bring the seed back, then the rest of the nodes,
so they finally all agreed on the ring again.

Then I started one of the new nodes, and have been watching the logs, so
far 2 hours since the "Bootstrapping" message appeared in the new
log and nothing has happened.  No anticompaction messages anywhere, there's
one node compacting, but its on the other end of the ring, so no where near
that new node.  I'm wondering if it will ever get data at this point.

Is there something else I should try?  The only thing I can think of
is deleting the system directory on the new node, and restarting, so
I'll try that and see if it does anything.

-Anthony

On Thu, Jul 15, 2010 at 03:43:49PM -0500, Jonathan Ellis wrote:
> On Thu, Jul 15, 2010 at 3:28 PM, Anthony Molinaro
> <an...@alumni.caltech.edu> wrote:
> > Is the fact that 2 new nodes are in the range messing it up?
> 
> Probably.
> 
> >  And if so
> > how do I recover (I'm thinking, shutdown new nodes 2,3,4,5, the bringing
> > up nodes 2,4, waiting for them to finish, then bringing up 3,5?).
> 
> Yes.
> 
> You might have to restart the old nodes too to clear out the confusion.
> 
> -- 
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com

-- 
------------------------------------------------------------------------
Anthony Molinaro                           <an...@alumni.caltech.edu>

Re: Bootstrap question

Posted by Jonathan Ellis <jb...@gmail.com>.

On Thu, Jul 15, 2010 at 3:28 PM, Anthony Molinaro
<an...@alumni.caltech.edu> wrote:
> Is the fact that 2 new nodes are in the range messing it up?

Probably.

>  And if so
> how do I recover (I'm thinking, shutdown new nodes 2,3,4,5, the bringing
> up nodes 2,4, waiting for them to finish, then bringing up 3,5?).

Yes.

You might have to restart the old nodes too to clear out the confusion.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Bootstrap question

Posted by Anthony Molinaro <an...@alumni.caltech.edu>.

This is a cluster which is horribly imbalanced because I didn't assign
initial tokens, so I'm adding 6 nodes with tokens according to the operations
page (ie, i * (2^127/N) with N = 6).

So here's what the ring will look like when bootstrap finishes

                     151901684708361811491018697633480111658
   Old 1  673.76 GB    1620761242680682425026573496599110901
   Old 2  204.90 GB   10637639655367601517656788464652024082
   Old 3  139.82 GB   21604748163853165203168832909938143241
   New 1              28356863910078205288614550619314017621
   Old 4  250.61 GB   46182405069378676149148922496055212595
   New 2              56713727820156410577229101238628035242
   New 3              85070591730234615865843651857942052863
   Old 5  572.91 GB  103509928471922053310251250943275708086
   New 4             113427455640312821154458202477256070485
   New 5             141784319550391026443072753096570088106
   Old 6  739.61 GB  151901684708361811491018697633480111658
   New 6             170141183460469231731687303715884105728

So from this it seems like I should see anti-compacition on old nodes
4, 5, 6 and 1.

Looking now, it seem that 1 and 6 have had some anti-compaction
happen, node 4 has

 INFO [STREAM-STAGE:1] 2010-07-14 20:53:26,579 StreamOut.java (line 95)
  Performing anticompaction ...

in the log but not a corresponding

CompactionManager.java (line 339) AntiCompacting [..]

line

Node 5 has nothing in its logs about anti-compaction.

Is the fact that 2 new nodes are in the range messing it up?  And if so
how do I recover (I'm thinking, shutdown new nodes 2,3,4,5, the bringing
up nodes 2,4, waiting for them to finish, then bringing up 3,5?).

-Anthony


On Wed, Jul 14, 2010 at 08:45:45PM -0500, Jonathan Ellis wrote:
> Each node logs what token it is going to bootstrap to.  Who owns the
> ranges that contain those tokens?
> 
> On Wed, Jul 14, 2010 at 5:58 PM, Anthony Molinaro
> <an...@alumni.caltech.edu> wrote:
> > Hi,
> >
> >  I have a 0.6.3 cluster which contains 6 nodes.  I added 6 new nodes
> > by setting AutoBootstrap to true and setting an InitialToken on each new
> > node, then waiting for the "Bootstrapping" message in the log before
> > starting another.  Then I've been watching the logs on the old boxes
> > waiting to see AntiCompaction messages.
> >
> > Unfortunately after several hours I only see 1 of the 6 old nodes has
> > the AntiCompaction message.  The new nodes are placed such that every
> > old node should have some data pulled from it.  Why don't I see more
> > Anti Compaction messages?  Are there other things I should be looking
> > at?
> >
> > Thanks,
> >
> > -Anthony
> >
> > --
> > ------------------------------------------------------------------------
> > Anthony Molinaro                           <an...@alumni.caltech.edu>
> >
> 
> 
> 
> -- 
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com

-- 
------------------------------------------------------------------------
Anthony Molinaro                           <an...@alumni.caltech.edu>

Re: Bootstrap question

Posted by Jonathan Ellis <jb...@gmail.com>.

Each node logs what token it is going to bootstrap to.  Who owns the
ranges that contain those tokens?

On Wed, Jul 14, 2010 at 5:58 PM, Anthony Molinaro
<an...@alumni.caltech.edu> wrote:
> Hi,
>
>  I have a 0.6.3 cluster which contains 6 nodes.  I added 6 new nodes
> by setting AutoBootstrap to true and setting an InitialToken on each new
> node, then waiting for the "Bootstrapping" message in the log before
> starting another.  Then I've been watching the logs on the old boxes
> waiting to see AntiCompaction messages.
>
> Unfortunately after several hours I only see 1 of the 6 old nodes has
> the AntiCompaction message.  The new nodes are placed such that every
> old node should have some data pulled from it.  Why don't I see more
> Anti Compaction messages?  Are there other things I should be looking
> at?
>
> Thanks,
>
> -Anthony
>
> --
> ------------------------------------------------------------------------
> Anthony Molinaro                           <an...@alumni.caltech.edu>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com