You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Jason Harvey <al...@gmail.com> on 2011/04/12 18:30:56 UTC

pycassa timeouts resolved by killing a random node in the ring

Interesting issue this morning.

My apps started throwing a bunch of pycassa timeouts all of a sudden.
The ring looked perfect. No load issues anywhere, and no errors in the
logs.

The site was basically down, so I got desperate and whacked a random
node in the ring. As soon as gossip saw it go down, the timeouts went
away. Thinking that was kinda crazy, I started the node back up. As
soon as it rejoined the ring, pycassa started timing out again. I then
killed another random node, far away from the first node I killed, and
the timeouts stopped again. Started it back up, and the timeouts
started again when it rejoined the ring.

Repeated this process once more just to make sure I wasn't insane, and
the same result happened. Killing any single node, anywhere in the
ring, fixes my timeouts.

Actively able to repro this. I am having to just keep one node down
right now so the site doesn't break. Desperate for any suggestions or
advice on this.

Using pycassa 1.0.7. Timeout is set to 15 seconds, with 3 retries.
Reads and writes are in quorum. 27 nodes in the ring, with an RF of 3.

Thanks,
Jason

Re: pycassa timeouts resolved by killing a random node in the ring

Posted by aaron morton <aa...@thelastpickle.com>.
First, lets check if the timeouts are client or server side. 
 
What was the timeout error stack ? 
Were they (python/thrift) socket timeouts or TimedOutException's raised by the cassandra thrift code.

Is it across all requests / clients or say just read?
Have you tried asking on http://groups.google.com/group/pycassa-discuss ?

Hope that helps. 
Aaron

On 13 Apr 2011, at 04:30, Jason Harvey wrote:

> Interesting issue this morning.
> 
> My apps started throwing a bunch of pycassa timeouts all of a sudden.
> The ring looked perfect. No load issues anywhere, and no errors in the
> logs.
> 
> The site was basically down, so I got desperate and whacked a random
> node in the ring. As soon as gossip saw it go down, the timeouts went
> away. Thinking that was kinda crazy, I started the node back up. As
> soon as it rejoined the ring, pycassa started timing out again. I then
> killed another random node, far away from the first node I killed, and
> the timeouts stopped again. Started it back up, and the timeouts
> started again when it rejoined the ring.
> 
> Repeated this process once more just to make sure I wasn't insane, and
> the same result happened. Killing any single node, anywhere in the
> ring, fixes my timeouts.
> 
> Actively able to repro this. I am having to just keep one node down
> right now so the site doesn't break. Desperate for any suggestions or
> advice on this.
> 
> Using pycassa 1.0.7. Timeout is set to 15 seconds, with 3 retries.
> Reads and writes are in quorum. 27 nodes in the ring, with an RF of 3.
> 
> Thanks,
> Jason