You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by TSANG Yiu Wing <yw...@gmail.com> on 2011/02/07 08:51:32 UTC

seed node failure crash the whole cluster

cassandra version: 0.7

client library: scale7-pelops / 1.0-RC1-0.7.0-SNAPSHOT

cluster: 3 machines (A, B, C)

details:
it works perfectly when all 3 machines are up and running

but if the seed machine is down, the problems happen:

1) new client connection cannot be established

2) if a client keeps connecting to and operating at (issue get and
update) the cluster, when the seed is down, the working client will
throw exception upon the next operation

3) using cassandra-cli to connect the remaining nodes in the cluster,
"Internal error processing get_range_slices" will happen when querying
column family
> list <cf>;


so i would like to know if the situation described above is normal or not?

if yes, does that mean the seed node is the single point of failure?

wing

Re: seed node failure crash the whole cluster

Posted by TSANG Yiu Wing <yw...@gmail.com>.

i will continue the issue here:

http://groups.google.com/group/scale7/browse_thread/thread/dd74f1d6265ae2e7

thanks


On Tue, Feb 8, 2011 at 7:44 AM, Dan Washusen <da...@reactive.org> wrote:
> Hi,
> I've added some comments and questions inline.
>
> Cheers,
> Dan
> On 8 February 2011 10:00, Jonathan Ellis <jb...@gmail.com> wrote:
>>
>> On Mon, Feb 7, 2011 at 1:51 AM, TSANG Yiu Wing <yw...@gmail.com> wrote:
>> > cassandra version: 0.7
>> >
>> > client library: scale7-pelops / 1.0-RC1-0.7.0-SNAPSHOT
>> >
>> > cluster: 3 machines (A, B, C)
>> >
>> > details:
>> > it works perfectly when all 3 machines are up and running
>> >
>> > but if the seed machine is down, the problems happen:
>> >
>> > 1) new client connection cannot be established
>>
>> sounds like pelops relies on the seed node to introduce it to the
>> cluster.  you should configure it either with a hardcoded list of
>> nodes or use something like RRDNS instead.  I don't use pelops so I
>> can't help other than that.  (I believe there is a mailing list for
>> Pelops though.)
>
> When dynamic node discovery is turned on (off by default) it doesn't
> (shouldn't) rely on the initial seed node once past initialization.  So
> either make sure you have dynamic node discovery turned on or seed Pelops
> with all nodes in your cluster...
> It would be helpful if you provided more information about the errors you're
> seeing preferably with debug level logging turned on.
>
>>
>> > 2) if a client keeps connecting to and operating at (issue get and
>> > update) the cluster, when the seed is down, the working client will
>> > throw exception upon the next operation
>>
>> I know Hector supports transparent failover to another Cassandra node.
>>  Perhaps Pelops does not.
>
> Pelops will validate connections at a configurable period (60 seconds by
> default) and remove them from the pool.  Pelops will also retry the
> operation three times (configurable) against a different node in the pool
> each time.
> If you want Pelops to take more agressive actions when it detects downed
> nodes then check out
> org.scale7.cassandra.pelops.pool.CommonsBackedPool.INodeSuspensionStrategy.
>
>>
>> > 3) using cassandra-cli to connect the remaining nodes in the cluster,
>> > "Internal error processing get_range_slices" will happen when querying
>> > column family
>> >> list <cf>;
>>
>> Cassandra always logs the cause of internal errors in system.log, so
>> you should look there.
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of DataStax, the source for professional Cassandra support
>> http://www.datastax.com
>
>

Re: seed node failure crash the whole cluster

Posted by Dan Washusen <da...@reactive.org>.

Hi,
I've added some comments and questions inline.

Cheers,
Dan

On 8 February 2011 10:00, Jonathan Ellis <jb...@gmail.com> wrote:

> On Mon, Feb 7, 2011 at 1:51 AM, TSANG Yiu Wing <yw...@gmail.com> wrote:
> > cassandra version: 0.7
> >
> > client library: scale7-pelops / 1.0-RC1-0.7.0-SNAPSHOT
> >
> > cluster: 3 machines (A, B, C)
> >
> > details:
> > it works perfectly when all 3 machines are up and running
> >
> > but if the seed machine is down, the problems happen:
> >
> > 1) new client connection cannot be established
>
> sounds like pelops relies on the seed node to introduce it to the
> cluster.  you should configure it either with a hardcoded list of
> nodes or use something like RRDNS instead.  I don't use pelops so I
> can't help other than that.  (I believe there is a mailing list for
> Pelops though.)
>

When dynamic node discovery is turned on (off by default) it doesn't
(shouldn't) rely on the initial seed node once past initialization.  So
either make sure you have dynamic node discovery turned on or seed Pelops
with all nodes in your cluster...

It would be helpful if you provided more information about the errors you're
seeing preferably with debug level logging turned on.

>
> > 2) if a client keeps connecting to and operating at (issue get and
> > update) the cluster, when the seed is down, the working client will
> > throw exception upon the next operation
>
> I know Hector supports transparent failover to another Cassandra node.
>  Perhaps Pelops does not.
>

Pelops will validate connections at a configurable period (60 seconds by
default) and remove them from the pool.  Pelops will also retry the
operation three times (configurable) against a different node in the pool
each time.

If you want Pelops to take more agressive actions when it detects downed
nodes then check out
org.scale7.cassandra.pelops.pool.CommonsBackedPool.INodeSuspensionStrategy.

>
> > 3) using cassandra-cli to connect the remaining nodes in the cluster,
> > "Internal error processing get_range_slices" will happen when querying
> > column family
> >> list <cf>;
>
> Cassandra always logs the cause of internal errors in system.log, so
> you should look there.
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>

Re: seed node failure crash the whole cluster

Posted by Jonathan Ellis <jb...@gmail.com>.

On Mon, Feb 7, 2011 at 1:51 AM, TSANG Yiu Wing <yw...@gmail.com> wrote:
> cassandra version: 0.7
>
> client library: scale7-pelops / 1.0-RC1-0.7.0-SNAPSHOT
>
> cluster: 3 machines (A, B, C)
>
> details:
> it works perfectly when all 3 machines are up and running
>
> but if the seed machine is down, the problems happen:
>
> 1) new client connection cannot be established

sounds like pelops relies on the seed node to introduce it to the
cluster.  you should configure it either with a hardcoded list of
nodes or use something like RRDNS instead.  I don't use pelops so I
can't help other than that.  (I believe there is a mailing list for
Pelops though.)

> 2) if a client keeps connecting to and operating at (issue get and
> update) the cluster, when the seed is down, the working client will
> throw exception upon the next operation

I know Hector supports transparent failover to another Cassandra node.
 Perhaps Pelops does not.

> 3) using cassandra-cli to connect the remaining nodes in the cluster,
> "Internal error processing get_range_slices" will happen when querying
> column family
>> list <cf>;

Cassandra always logs the cause of internal errors in system.log, so
you should look there.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com