You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by James Golick <ja...@gmail.com> on 2010/03/27 18:29:32 UTC

Nodes Timing Out

Hey,

I put our first cluster in to production (writing but not reading) a couple
of days ago. Right now, it's got two pretty sizeable nodes taking about 200
writes per second each and virtually no reads.

Eventually, though, (and this has happened twice), both nodes seem to start
timing out. If I run nodetool cfstats, I get:

[james@cassandra1 ~]# /opt/cassandra/bin/nodetool -h
cassandra1.fetlife.comcfstats
Keyspace: system
        Read Count: 39
        Read Latency: 0.35925641025641025 ms.
        Write Count: 3
        Write Latency: 0.166 ms.
        Pending Tasks: 66
                Column Family: HintsColumnFamily
                SSTable count: 0
                Space used (live): 0
                Space used (total): 0

and then it just hangs there.

Any ideas?

- James

Re: Nodes Timing Out

Posted by James Golick <ja...@gmail.com>.

Oops, I was doing ulimit. ulimit -n returns 1024.

On Sun, Mar 28, 2010 at 3:25 AM, Benoit Perroud <be...@noisette.ch> wrote:

> ulimit -n returns you unlimited ?
>
>
> 2010/3/28 James Golick <ja...@gmail.com>:
> > unlimited
> >
> > On Sat, Mar 27, 2010 at 12:09 PM, Chris Goffinet <go...@digg.com>
> wrote:
> >>
> >> what's the ulimit set to?
> >> -Chris
> >> On Mar 27, 2010, at 10:29 AM, James Golick wrote:
> >>
> >> Hey,
> >> I put our first cluster in to production (writing but not reading) a
> >> couple of days ago. Right now, it's got two pretty sizeable nodes taking
> >> about 200 writes per second each and virtually no reads.
> >> Eventually, though, (and this has happened twice), both nodes seem to
> >> start timing out. If I run nodetool cfstats, I get:
> >> [james@cassandra1 ~]# /opt/cassandra/bin/nodetool -h
> >> cassandra1.fetlife.com cfstats
> >> Keyspace: system
> >>         Read Count: 39
> >>         Read Latency: 0.35925641025641025 ms.
> >>         Write Count: 3
> >>         Write Latency: 0.166 ms.
> >>         Pending Tasks: 66
> >>                 Column Family: HintsColumnFamily
> >>                 SSTable count: 0
> >>                 Space used (live): 0
> >>                 Space used (total): 0
> >> and then it just hangs there.
> >> Any ideas?
> >> - James
> >
> >
>

Re: Nodes Timing Out

Posted by Benoit Perroud <be...@noisette.ch>.

ulimit -n returns you unlimited ?


2010/3/28 James Golick <ja...@gmail.com>:
> unlimited
>
> On Sat, Mar 27, 2010 at 12:09 PM, Chris Goffinet <go...@digg.com> wrote:
>>
>> what's the ulimit set to?
>> -Chris
>> On Mar 27, 2010, at 10:29 AM, James Golick wrote:
>>
>> Hey,
>> I put our first cluster in to production (writing but not reading) a
>> couple of days ago. Right now, it's got two pretty sizeable nodes taking
>> about 200 writes per second each and virtually no reads.
>> Eventually, though, (and this has happened twice), both nodes seem to
>> start timing out. If I run nodetool cfstats, I get:
>> [james@cassandra1 ~]# /opt/cassandra/bin/nodetool -h
>> cassandra1.fetlife.com cfstats
>> Keyspace: system
>>         Read Count: 39
>>         Read Latency: 0.35925641025641025 ms.
>>         Write Count: 3
>>         Write Latency: 0.166 ms.
>>         Pending Tasks: 66
>>                 Column Family: HintsColumnFamily
>>                 SSTable count: 0
>>                 Space used (live): 0
>>                 Space used (total): 0
>> and then it just hangs there.
>> Any ideas?
>> - James
>
>

Re: Nodes Timing Out

Posted by James Golick <ja...@gmail.com>.

unlimited

On Sat, Mar 27, 2010 at 12:09 PM, Chris Goffinet <go...@digg.com> wrote:

> what's the ulimit set to?
>
> -Chris
>
> On Mar 27, 2010, at 10:29 AM, James Golick wrote:
>
> Hey,
>
> I put our first cluster in to production (writing but not reading) a couple
> of days ago. Right now, it's got two pretty sizeable nodes taking about 200
> writes per second each and virtually no reads.
>
> Eventually, though, (and this has happened twice), both nodes seem to start
> timing out. If I run nodetool cfstats, I get:
>
> [james@cassandra1 ~]# /opt/cassandra/bin/nodetool -h
> cassandra1.fetlife.com cfstats
> Keyspace: system
>         Read Count: 39
>         Read Latency: 0.35925641025641025 ms.
>         Write Count: 3
>         Write Latency: 0.166 ms.
>         Pending Tasks: 66
>                 Column Family: HintsColumnFamily
>                 SSTable count: 0
>                 Space used (live): 0
>                 Space used (total): 0
>
> and then it just hangs there.
>
> Any ideas?
>
> - James
>
>
>

Re: Nodes Timing Out

Posted by Chris Goffinet <go...@digg.com>.

what's the ulimit set to?

-Chris

On Mar 27, 2010, at 10:29 AM, James Golick wrote:

> Hey,
> 
> I put our first cluster in to production (writing but not reading) a couple of days ago. Right now, it's got two pretty sizeable nodes taking about 200 writes per second each and virtually no reads.
> 
> Eventually, though, (and this has happened twice), both nodes seem to start timing out. If I run nodetool cfstats, I get:
> 
> [james@cassandra1 ~]# /opt/cassandra/bin/nodetool -h cassandra1.fetlife.com cfstats
> Keyspace: system
>         Read Count: 39
>         Read Latency: 0.35925641025641025 ms.
>         Write Count: 3
>         Write Latency: 0.166 ms.
>         Pending Tasks: 66
>                 Column Family: HintsColumnFamily
>                 SSTable count: 0
>                 Space used (live): 0
>                 Space used (total): 0
> 
> and then it just hangs there.
> 
> Any ideas?
> 
> - James

Re: Cassandra cluster does not tolerate single node failure

Posted by Jonathan Ellis <jb...@gmail.com>.

This is a known problem with 0.5 that was addressed in 0.6.

On Wed, Apr 7, 2010 at 9:18 AM, Oleg Anastasjev <ol...@gmail.com> wrote:
> Hello,
>
> I am doing some tests of cassandra clsuter behavior on several failure
> scenarios. And i am stuck woith the very 1st test - what happens, if 1 node of
> cluster becomes unavailable.
> I have 4 4gb nodes loaded with write mostly test. Normally it works at the rate
> about 12000 ops/second. Replication Factor is 2.
> After a while, I shutdown node 4. And whole cluster's performance drops down to
> 60 (yes, two hundred times slower!) ops per second. I checked this on both 0.5.0
> and 0.5.1 versions.
>
> This is my ring after shutdown:
> Address       Status     Load          Range
>  Ring
>                                       127605887595351923798765477786913079293
> 62.85.54.46   Up         119.03 MB     0
>  |<--|
> 62.85.54.47   Up         118.76 MB     42535295865117307932921825928971026431
>  |   |
> 62.85.54.48   Up         103.95 MB     85070591730234615865843651857942052862
>  |   |
> 62.85.54.49   Down       0 bytes       127605887595351923798765477786913079293
>  |-->|
>
>
> After doing a bit of investigation, i found, that 62.85.54.46 and 62.85.54.47
> started to starve in row mutation stage:
> 46:
> ROW-MUTATION-STAGE               32       313        1875089
> 47:
> ROW-MUTATION-STAGE               32      3042        1872123
> but 48 is not:
> ROW-MUTATION-STAGE                0         0        1668532
>
> All these mutations go to HintsColumnFamily -
> cfstats shows actility in this CF only for 46 and 47 nodes:
> Keyspace: system
>        Read Count: 0
>        Read Latency: NaN ms.
>        Write Count: 4953
>        Write Latency: 386.766 ms.
>        Pending Tasks: 0
>                Column Family: LocationInfo
>                Memtable Columns Count: 0
>                Memtable Data Size: 0
>                Memtable Switch Count: 1
>                Read Count: 0
>                Read Latency: NaN ms.
>                Write Count: 0
>                Write Latency: NaN ms.
>                Pending Tasks: 0
>
>                Column Family: HintsColumnFamily
>                Memtable Columns Count: 173506
>                Memtable Data Size: 1648344
>                Memtable Switch Count: 1
>                Read Count: 0
>                Read Latency: NaN ms.
>                Write Count: 4954
>                Write Latency: 387.473 ms.
>                Pending Tasks: 0
> please note enormously slow write latency.
>
> Interesting, that issuing "nodeprobe flush system" command to 46 and 47 nodes
> speedup processing for a short period of time, but then it quickly returns bakc
> to 66 ops/second.
>
> I suspect, that these nodes create very much subcolumns in supercolumn of CF
> HintsColumnFamily in memory table.
>
> What can i do to have cassandra cluster to tolerate single node failure better ?
>
>
>
>
>
>

Cassandra cluster does not tolerate single node failure

Posted by Oleg Anastasjev <ol...@gmail.com>.

Hello,

I am doing some tests of cassandra clsuter behavior on several failure
scenarios. And i am stuck woith the very 1st test - what happens, if 1 node of
cluster becomes unavailable. 
I have 4 4gb nodes loaded with write mostly test. Normally it works at the rate
about 12000 ops/second. Replication Factor is 2. 
After a while, I shutdown node 4. And whole cluster's performance drops down to
60 (yes, two hundred times slower!) ops per second. I checked this on both 0.5.0
and 0.5.1 versions. 

This is my ring after shutdown:
Address       Status     Load          Range                                   
  Ring
                                       127605887595351923798765477786913079293
62.85.54.46   Up         119.03 MB     0                                       
  |<--|
62.85.54.47   Up         118.76 MB     42535295865117307932921825928971026431  
  |   |
62.85.54.48   Up         103.95 MB     85070591730234615865843651857942052862  
  |   |
62.85.54.49   Down       0 bytes       127605887595351923798765477786913079293 
  |-->|


After doing a bit of investigation, i found, that 62.85.54.46 and 62.85.54.47
started to starve in row mutation stage:
46:
ROW-MUTATION-STAGE               32       313        1875089
47:
ROW-MUTATION-STAGE               32      3042        1872123
but 48 is not:
ROW-MUTATION-STAGE                0         0        1668532

All these mutations go to HintsColumnFamily -
cfstats shows actility in this CF only for 46 and 47 nodes:
Keyspace: system
        Read Count: 0
        Read Latency: NaN ms.
        Write Count: 4953
        Write Latency: 386.766 ms.
        Pending Tasks: 0
                Column Family: LocationInfo
                Memtable Columns Count: 0
                Memtable Data Size: 0
                Memtable Switch Count: 1
                Read Count: 0
                Read Latency: NaN ms.
                Write Count: 0
                Write Latency: NaN ms.
                Pending Tasks: 0

                Column Family: HintsColumnFamily
                Memtable Columns Count: 173506
                Memtable Data Size: 1648344
                Memtable Switch Count: 1
                Read Count: 0
                Read Latency: NaN ms.
                Write Count: 4954
                Write Latency: 387.473 ms.
                Pending Tasks: 0
please note enormously slow write latency.

Interesting, that issuing "nodeprobe flush system" command to 46 and 47 nodes
speedup processing for a short period of time, but then it quickly returns bakc
to 66 ops/second.

I suspect, that these nodes create very much subcolumns in supercolumn of CF
HintsColumnFamily in memory table. 

What can i do to have cassandra cluster to tolerate single node failure better ?

Re: Nodes Timing Out

Posted by James Golick <ja...@gmail.com>.

Nothing in the log. No CPU activity.

I'll try to strace it and connect with jconsole next time it happens.

On Sat, Mar 27, 2010 at 11:09 AM, Jonathan Ellis <jb...@gmail.com> wrote:

> anything interesting in the log?
>
> is there cpu activity?
>
> can you connect w/ jconsole?
>
> On Sat, Mar 27, 2010 at 12:29 PM, James Golick <ja...@gmail.com>
> wrote:
> > Hey,
> > I put our first cluster in to production (writing but not reading) a
> couple
> > of days ago. Right now, it's got two pretty sizeable nodes taking about
> 200
> > writes per second each and virtually no reads.
> > Eventually, though, (and this has happened twice), both nodes seem to
> start
> > timing out. If I run nodetool cfstats, I get:
> > [james@cassandra1 ~]# /opt/cassandra/bin/nodetool -h
> cassandra1.fetlife.com
> > cfstats
> > Keyspace: system
> >         Read Count: 39
> >         Read Latency: 0.35925641025641025 ms.
> >         Write Count: 3
> >         Write Latency: 0.166 ms.
> >         Pending Tasks: 66
> >                 Column Family: HintsColumnFamily
> >                 SSTable count: 0
> >                 Space used (live): 0
> >                 Space used (total): 0
> > and then it just hangs there.
> > Any ideas?
> > - James
>

Re: Nodes Timing Out

Posted by Jonathan Ellis <jb...@gmail.com>.

anything interesting in the log?

is there cpu activity?

can you connect w/ jconsole?

On Sat, Mar 27, 2010 at 12:29 PM, James Golick <ja...@gmail.com> wrote:
> Hey,
> I put our first cluster in to production (writing but not reading) a couple
> of days ago. Right now, it's got two pretty sizeable nodes taking about 200
> writes per second each and virtually no reads.
> Eventually, though, (and this has happened twice), both nodes seem to start
> timing out. If I run nodetool cfstats, I get:
> [james@cassandra1 ~]# /opt/cassandra/bin/nodetool -h cassandra1.fetlife.com
> cfstats
> Keyspace: system
>         Read Count: 39
>         Read Latency: 0.35925641025641025 ms.
>         Write Count: 3
>         Write Latency: 0.166 ms.
>         Pending Tasks: 66
>                 Column Family: HintsColumnFamily
>                 SSTable count: 0
>                 Space used (live): 0
>                 Space used (total): 0
> and then it just hangs there.
> Any ideas?
> - James