You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by "B. Todd Burruss" <bb...@real.com> on 2010/01/15 21:39:29 UTC

something bizzare occured

i'm trying to understand why cassandra 0.5 RC3 is behaving like it is.
I have a 5 node cluster, RF=3, W=ALL, R=1.  all is well if all the nodes
are running.  if i remove a node, then "puts" fail - doesn't matter
which host i'm connected to.  if i restart the node, then all goes back
to normal operation.

the obvious misunderstanding to me is that i have set W=ALL.  As I
understand it, this should mean that the data will be written to ALL the
replicas (RF=3) not all the nodes in the cluster.  is this a bug or a
misunderstanding?


i also see the following message upon restarting the node that i stopped
- is it a problem?


2010-01-15 12:27:30,892  WARN [MESSAGING-SERVICE-POOL:4]
[TcpConnection.java:485] Exception was generated at : 01/15/2010
12:27:30 on thread MESSAGING-SERVICE-POOL:4
Reached an EOL or something bizzare occured. Reading
from: /192.168.132.105 BufferSizeRemaining: 16
java.io.IOException: Reached an EOL or something bizzare occured.
Reading from: /192.168.132.105 BufferSizeRemaining: 16
	at org.apache.cassandra.net.io.StartState.doRead(StartState.java:44)
	at
org.apache.cassandra.net.io.ProtocolState.read(ProtocolState.java:39)
	at org.apache.cassandra.net.io.TcpReader.read(TcpReader.java:96)
	at org.apache.cassandra.net.TcpConnection
$ReadWorkItem.run(TcpConnection.java:445)
	at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:619)

2010-01-15 12:27:30,892  INFO [MESSAGING-SERVICE-POOL:4]
[TcpConnection.java:315] Closing errored connection
java.nio.channels.SocketChannel[connected local=/192.168.132.101:7000
remote=/192.168.132.105:40253]

Re: something bizzare occured

Posted by Brandon Williams <dr...@gmail.com>.

On Sat, Jan 16, 2010 at 11:00 AM, Todd Burruss <bb...@real.com> wrote:

> do these patches work for the 0.5 branch?  they don't seem to be in the tip
> of the branch

They might, I've not tried.  However, 685 was deemed a large enough change
to apply to trunk only, not 0.5, which is why you don't see them.

-Brandon

RE: something bizzare occured

Posted by Todd Burruss <bb...@real.com>.

do these patches work for the 0.5 branch?  they don't seem to be in the tip of the branch

________________________________________
From: Todd Burruss
Sent: Friday, January 15, 2010 3:57 PM
To: cassandra-user@incubator.apache.org
Subject: Re: something bizzare occured

yes it does.  i'll get trunk and try again.

thx!

On Fri, 2010-01-15 at 15:50 -0800, Brandon Williams wrote:
> On Fri, Jan 15, 2010 at 5:43 PM, B. Todd Burruss <bb...@real.com>
> wrote:
>         so i changed to QUORUM and retested.  "puts" again work as
>         expected when
>         a node is down.  thx!
>
>         however, the response time for puts went from about 5ms to
>         400ms because
>         i took 1 of the 5 nodes out.  ROW-MUTATION-STAGE pendings
>         jumped into to
>         100's on one of the remaining nodes and the WriteLatency for
>         the column
>         family on this node also went thru the roof.
>
>         i added the server back and the performance immediately went
>         back to the
>         way it was.
>
>         is cassandra trying to constantly connect to the downed
>         server?  or what
>         might be causing the performance to drop so dramatically?
>
>
>
>
>
> It sounds like you're running
> into: http://issues.apache.org/jira/browse/CASSANDRA-658
>
> -Brandon
>

Re: something bizzare occured

Posted by "B. Todd Burruss" <bb...@real.com>.

yes it does.  i'll get trunk and try again.

thx!

On Fri, 2010-01-15 at 15:50 -0800, Brandon Williams wrote:
> On Fri, Jan 15, 2010 at 5:43 PM, B. Todd Burruss <bb...@real.com>
> wrote:
>         so i changed to QUORUM and retested.  "puts" again work as
>         expected when
>         a node is down.  thx!
>         
>         however, the response time for puts went from about 5ms to
>         400ms because
>         i took 1 of the 5 nodes out.  ROW-MUTATION-STAGE pendings
>         jumped into to
>         100's on one of the remaining nodes and the WriteLatency for
>         the column
>         family on this node also went thru the roof.
>         
>         i added the server back and the performance immediately went
>         back to the
>         way it was.
>         
>         is cassandra trying to constantly connect to the downed
>         server?  or what
>         might be causing the performance to drop so dramatically?
>         
>         
>         
> 
> 
> It sounds like you're running
> into: http://issues.apache.org/jira/browse/CASSANDRA-658
>  
> -Brandon
>

Re: something bizzare occured

Posted by Brandon Williams <dr...@gmail.com>.

On Fri, Jan 15, 2010 at 5:43 PM, B. Todd Burruss <bb...@real.com> wrote:

> so i changed to QUORUM and retested.  "puts" again work as expected when
> a node is down.  thx!
>
> however, the response time for puts went from about 5ms to 400ms because
> i took 1 of the 5 nodes out.  ROW-MUTATION-STAGE pendings jumped into to
> 100's on one of the remaining nodes and the WriteLatency for the column
> family on this node also went thru the roof.
>
> i added the server back and the performance immediately went back to the
> way it was.
>
> is cassandra trying to constantly connect to the downed server?  or what
> might be causing the performance to drop so dramatically?
>
>
It sounds like you're running into:
http://issues.apache.org/jira/browse/CASSANDRA-658

-Brandon

Re: something bizzare occured

Posted by "B. Todd Burruss" <bb...@real.com>.

so i changed to QUORUM and retested.  "puts" again work as expected when
a node is down.  thx!

however, the response time for puts went from about 5ms to 400ms because
i took 1 of the 5 nodes out.  ROW-MUTATION-STAGE pendings jumped into to
100's on one of the remaining nodes and the WriteLatency for the column
family on this node also went thru the roof.

i added the server back and the performance immediately went back to the
way it was.

is cassandra trying to constantly connect to the downed server?  or what
might be causing the performance to drop so dramatically?

On Fri, 2010-01-15 at 13:20 -0800, Jonathan Ellis wrote:
> right
> 
> On Fri, Jan 15, 2010 at 3:13 PM, B. Todd Burruss <bb...@real.com> wrote:
> > so with 5 node cluster, R=W=Q and RF=3, i can only loose one consecutive
> > node on the consistency "ring", correct?
> >
> >
> > On Fri, 2010-01-15 at 12:54 -0800, Jonathan Ellis wrote:
> >> it has to do w/ consistency guarantees:
> >> http://wiki.apache.org/cassandra/HintedHandoff
> >>
> >> use quorum reads and writes instead of ALL on writes if you need both
> >> consistency and availability
> >>
> >> -Jonathan
> >>
> >> On Fri, Jan 15, 2010 at 2:50 PM, B. Todd Burruss <bb...@real.com> wrote:
> >> > that makes sense, but i have had trouble understanding why
> >> > hinted-handoff doesn't take care of it?  if not, how many nodes would i
> >> > need to prevent this?
> >> >
> >> > thx
> >> >
> >> >
> >> > On Fri, 2010-01-15 at 12:43 -0800, Jonathan Ellis wrote:
> >> >> On Fri, Jan 15, 2010 at 2:39 PM, B. Todd Burruss <bb...@real.com> wrote:
> >> >> > i'm trying to understand why cassandra 0.5 RC3 is behaving like it is.  I
> >> >> > have a 5 node cluster, RF=3, W=ALL, R=1.  all is well if all the nodes are
> >> >> > running.  if i remove a node, then "puts" fail - doesn't matter which host
> >> >> > i'm connected to.  if i restart the node, then all goes back to normal
> >> >> > operation.
> >> >> >
> >> >> > the obvious misunderstanding to me is that i have set W=ALL.  As I
> >> >> > understand it, this should mean that the data will be written to ALL the
> >> >> > replicas (RF=3) not all the nodes in the cluster.
> >> >>
> >> >> Right, but if you take one of the nodes down then it is going to be
> >> >> one of the three replicas for 3/5 of your keys.  (Could be more
> >> >> depending on your partitioner and whether you balanced your nodes.)
> >> >>
> >> >> > i also see the following message upon restarting the node that i stopped -
> >> >> > is it a problem?
> >> >>
> >> >> No.
> >> >>
> >> >> -Jonathan
> >> >
> >> >
> >> >
> >
> >
> >

Re: something bizzare occured

Posted by Jonathan Ellis <jb...@gmail.com>.

right

On Fri, Jan 15, 2010 at 3:13 PM, B. Todd Burruss <bb...@real.com> wrote:
> so with 5 node cluster, R=W=Q and RF=3, i can only loose one consecutive
> node on the consistency "ring", correct?
>
>
> On Fri, 2010-01-15 at 12:54 -0800, Jonathan Ellis wrote:
>> it has to do w/ consistency guarantees:
>> http://wiki.apache.org/cassandra/HintedHandoff
>>
>> use quorum reads and writes instead of ALL on writes if you need both
>> consistency and availability
>>
>> -Jonathan
>>
>> On Fri, Jan 15, 2010 at 2:50 PM, B. Todd Burruss <bb...@real.com> wrote:
>> > that makes sense, but i have had trouble understanding why
>> > hinted-handoff doesn't take care of it?  if not, how many nodes would i
>> > need to prevent this?
>> >
>> > thx
>> >
>> >
>> > On Fri, 2010-01-15 at 12:43 -0800, Jonathan Ellis wrote:
>> >> On Fri, Jan 15, 2010 at 2:39 PM, B. Todd Burruss <bb...@real.com> wrote:
>> >> > i'm trying to understand why cassandra 0.5 RC3 is behaving like it is.  I
>> >> > have a 5 node cluster, RF=3, W=ALL, R=1.  all is well if all the nodes are
>> >> > running.  if i remove a node, then "puts" fail - doesn't matter which host
>> >> > i'm connected to.  if i restart the node, then all goes back to normal
>> >> > operation.
>> >> >
>> >> > the obvious misunderstanding to me is that i have set W=ALL.  As I
>> >> > understand it, this should mean that the data will be written to ALL the
>> >> > replicas (RF=3) not all the nodes in the cluster.
>> >>
>> >> Right, but if you take one of the nodes down then it is going to be
>> >> one of the three replicas for 3/5 of your keys.  (Could be more
>> >> depending on your partitioner and whether you balanced your nodes.)
>> >>
>> >> > i also see the following message upon restarting the node that i stopped -
>> >> > is it a problem?
>> >>
>> >> No.
>> >>
>> >> -Jonathan
>> >
>> >
>> >
>
>
>

Re: something bizzare occured

Posted by "B. Todd Burruss" <bb...@real.com>.

so with 5 node cluster, R=W=Q and RF=3, i can only loose one consecutive
node on the consistency "ring", correct?


On Fri, 2010-01-15 at 12:54 -0800, Jonathan Ellis wrote:
> it has to do w/ consistency guarantees:
> http://wiki.apache.org/cassandra/HintedHandoff
> 
> use quorum reads and writes instead of ALL on writes if you need both
> consistency and availability
> 
> -Jonathan
> 
> On Fri, Jan 15, 2010 at 2:50 PM, B. Todd Burruss <bb...@real.com> wrote:
> > that makes sense, but i have had trouble understanding why
> > hinted-handoff doesn't take care of it?  if not, how many nodes would i
> > need to prevent this?
> >
> > thx
> >
> >
> > On Fri, 2010-01-15 at 12:43 -0800, Jonathan Ellis wrote:
> >> On Fri, Jan 15, 2010 at 2:39 PM, B. Todd Burruss <bb...@real.com> wrote:
> >> > i'm trying to understand why cassandra 0.5 RC3 is behaving like it is.  I
> >> > have a 5 node cluster, RF=3, W=ALL, R=1.  all is well if all the nodes are
> >> > running.  if i remove a node, then "puts" fail - doesn't matter which host
> >> > i'm connected to.  if i restart the node, then all goes back to normal
> >> > operation.
> >> >
> >> > the obvious misunderstanding to me is that i have set W=ALL.  As I
> >> > understand it, this should mean that the data will be written to ALL the
> >> > replicas (RF=3) not all the nodes in the cluster.
> >>
> >> Right, but if you take one of the nodes down then it is going to be
> >> one of the three replicas for 3/5 of your keys.  (Could be more
> >> depending on your partitioner and whether you balanced your nodes.)
> >>
> >> > i also see the following message upon restarting the node that i stopped -
> >> > is it a problem?
> >>
> >> No.
> >>
> >> -Jonathan
> >
> >
> >

Re: something bizzare occured

Posted by Jonathan Ellis <jb...@gmail.com>.

it has to do w/ consistency guarantees:
http://wiki.apache.org/cassandra/HintedHandoff

use quorum reads and writes instead of ALL on writes if you need both
consistency and availability

-Jonathan

On Fri, Jan 15, 2010 at 2:50 PM, B. Todd Burruss <bb...@real.com> wrote:
> that makes sense, but i have had trouble understanding why
> hinted-handoff doesn't take care of it?  if not, how many nodes would i
> need to prevent this?
>
> thx
>
>
> On Fri, 2010-01-15 at 12:43 -0800, Jonathan Ellis wrote:
>> On Fri, Jan 15, 2010 at 2:39 PM, B. Todd Burruss <bb...@real.com> wrote:
>> > i'm trying to understand why cassandra 0.5 RC3 is behaving like it is.  I
>> > have a 5 node cluster, RF=3, W=ALL, R=1.  all is well if all the nodes are
>> > running.  if i remove a node, then "puts" fail - doesn't matter which host
>> > i'm connected to.  if i restart the node, then all goes back to normal
>> > operation.
>> >
>> > the obvious misunderstanding to me is that i have set W=ALL.  As I
>> > understand it, this should mean that the data will be written to ALL the
>> > replicas (RF=3) not all the nodes in the cluster.
>>
>> Right, but if you take one of the nodes down then it is going to be
>> one of the three replicas for 3/5 of your keys.  (Could be more
>> depending on your partitioner and whether you balanced your nodes.)
>>
>> > i also see the following message upon restarting the node that i stopped -
>> > is it a problem?
>>
>> No.
>>
>> -Jonathan
>
>
>

Re: something bizzare occured

Posted by "B. Todd Burruss" <bb...@real.com>.

that makes sense, but i have had trouble understanding why
hinted-handoff doesn't take care of it?  if not, how many nodes would i
need to prevent this?

thx


On Fri, 2010-01-15 at 12:43 -0800, Jonathan Ellis wrote:
> On Fri, Jan 15, 2010 at 2:39 PM, B. Todd Burruss <bb...@real.com> wrote:
> > i'm trying to understand why cassandra 0.5 RC3 is behaving like it is.  I
> > have a 5 node cluster, RF=3, W=ALL, R=1.  all is well if all the nodes are
> > running.  if i remove a node, then "puts" fail - doesn't matter which host
> > i'm connected to.  if i restart the node, then all goes back to normal
> > operation.
> >
> > the obvious misunderstanding to me is that i have set W=ALL.  As I
> > understand it, this should mean that the data will be written to ALL the
> > replicas (RF=3) not all the nodes in the cluster.
> 
> Right, but if you take one of the nodes down then it is going to be
> one of the three replicas for 3/5 of your keys.  (Could be more
> depending on your partitioner and whether you balanced your nodes.)
> 
> > i also see the following message upon restarting the node that i stopped -
> > is it a problem?
> 
> No.
> 
> -Jonathan

Re: something bizzare occured

Posted by Jonathan Ellis <jb...@gmail.com>.

On Fri, Jan 15, 2010 at 2:39 PM, B. Todd Burruss <bb...@real.com> wrote:
> i'm trying to understand why cassandra 0.5 RC3 is behaving like it is.  I
> have a 5 node cluster, RF=3, W=ALL, R=1.  all is well if all the nodes are
> running.  if i remove a node, then "puts" fail - doesn't matter which host
> i'm connected to.  if i restart the node, then all goes back to normal
> operation.
>
> the obvious misunderstanding to me is that i have set W=ALL.  As I
> understand it, this should mean that the data will be written to ALL the
> replicas (RF=3) not all the nodes in the cluster.

Right, but if you take one of the nodes down then it is going to be
one of the three replicas for 3/5 of your keys.  (Could be more
depending on your partitioner and whether you balanced your nodes.)

> i also see the following message upon restarting the node that i stopped -
> is it a problem?

No.

-Jonathan