You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Ian Soboroff <is...@gmail.com> on 2010/05/18 15:24:30 UTC

Scaling problems

I hope this isn't too much of a newbie question.  I am using Cassandra 0.6.1
on a small cluster of Linux boxes - 14 nodes, each with 8GB RAM and 5 data
drives.  The nodes are running HDFS to serve files within the cluster, but
at the moment the rest of Hadoop is shut down.  I'm trying to load a large
set of web pages (the ClueWeb collection, but more is coming) and my
Cassandra daemons keep dying.

I'm loading the pages into a simple column family that lets me fetch out
pages by an internal ID or by URL.  The biggest thing in the row is the page
content, maybe 15-20k per page of raw HTML.  There aren't a lot of columns.
I tried Thrift, Hector, and the BMT interface, and at the moment I'm doing
batch mutations over Thrift, about 2500 pages per batch, because that was
fastest for me in testing.

At this point, each Cassandra node has between 500GB and 1.5TB according to
nodetool ring.  Let's say I start the daemons up, and they all go live after
a couple minutes of scanning the tables.  I then start my importer, which is
a single Java process reading Clueweb bundles over HDFS, cutting them up,
and sending the mutations to Cassandra.  I only talk to one node at a time,
switching to a new node when I get an exception.  As the job runs over a few
hours, the Cassandra daemons eventually fall over, either with no error in
the log or reporting that they are out of heap.

Each daemon is getting 6GB of RAM and has scads of disk space to play with.
I've set the storage-conf.xml to take 256MB in a memtable before flushing
(like the BMT case), and to do batch commit log flushes, and to not have any
caching in the CFs.  I'm sure I must be tuning something wrong.  I would
eventually like this Cassandra setup to serve a light request load but over
say 50-100 TB of data.  I'd appreciate any help or advice you can offer.

Thanks,
Ian

Re: Scaling problems

Posted by Ian Soboroff <is...@gmail.com>.
Ok, spending some time slogging through the cassandra-user archives.  Seems
lots of folks have this problem.  Starting with a JVM upgrade, then skimming
through JIRA looking for patches.

Ian

On Fri, May 21, 2010 at 12:09 PM, Ian Soboroff <is...@gmail.com> wrote:

> So at the moment, I'm not running my loader, and I'm looking at one node
> which is slow to respond to nodetool requests.  At this point, it has a pile
> of hinted-handoffs pending which don't seem to be draining out.  The
> system.log shows that it's GCing pretty much constantly.
> Ian
>
>
> $ /usr/local/src/cassandra/bin/nodetool --host node7 tpstats
> Pool Name                    Active   Pending      Completed
> FILEUTILS-DELETE-POOL             0         0            178
> STREAM-STAGE                      0         0              0
> RESPONSE-STAGE                    0         0          21852
> ROW-READ-STAGE                    0         0              0
> LB-OPERATIONS                     0         0              0
> MESSAGE-DESERIALIZER-POOL         0         0        1648536
> GMFD                              0         0         125430
> LB-TARGET                         0         0              0
> CONSISTENCY-MANAGER               0         0              0
> ROW-MUTATION-STAGE                2         2        1886537
> MESSAGE-STREAMING-POOL            0         0              0
> LOAD-BALANCER-STAGE               0         0              0
> FLUSH-SORTER-POOL                 0         0              0
> MEMTABLE-POST-FLUSHER             0         0            206
> FLUSH-WRITER-POOL                 0         0            206
> AE-SERVICE-STAGE                  0         0              0
> HINTED-HANDOFF-POOL               1       158             23
>
>
>
> On Fri, May 21, 2010 at 10:37 AM, Ian Soboroff <is...@gmail.com>wrote:
>
>> On the to-do list for today.  Is there a tool to aggregate all  the JMX
>> stats from all nodes?  I mean, something a little more complete than nagios.
>> Ian
>>
>>
>> On Fri, May 21, 2010 at 10:23 AM, Jonathan Ellis <jb...@gmail.com>wrote:
>>
>>> you should check the jmx stages I posted about
>>>
>>> On Fri, May 21, 2010 at 7:05 AM, Ian Soboroff <is...@gmail.com>
>>> wrote:
>>> > Just an update.  I rolled the memtable size back to 128MB.  I am still
>>> > seeing that the daemon runs for a while with reasonable heap usage, but
>>> then
>>> > the heap climbs up to the max (6GB in this case, should be plenty) and
>>> it
>>> > starts GCing, without much getting cleared.  The client catches lots of
>>> > exceptions, where I wait 30 seconds and try again, with a new client if
>>> > necessary, but it doesn't clear up.
>>> >
>>> > Could this be related to memory leak problems I've skimmed past on the
>>> list
>>> > here?
>>> >
>>> > It can't be that I'm creating rows a bit at a time... once I stick a
>>> web
>>> > page into two CFs, it's over and done with for this application.  I'm
>>> just
>>> > trying to get stuff loaded.
>>> >
>>> > Is there a limit to how much on-disk data a Cassandra daemon can
>>> manage?  Is
>>> > there runtime overhead associated with stuff on disk?
>>> >
>>> > Ian
>>> >
>>> > On Thu, May 20, 2010 at 9:31 PM, Ian Soboroff <is...@gmail.com>
>>> wrote:
>>> >>
>>> >> Excellent leads, thanks.  cassandra.in.sh has a heap of 6GB, but I
>>> didn't
>>> >> realize that I was trying to float so many memtables.  I'll poke
>>> tomorrow
>>> >> and report if it gets fixed.
>>> >> Ian
>>> >>
>>> >> On Thu, May 20, 2010 at 10:40 AM, Jonathan Ellis <jb...@gmail.com>
>>> >> wrote:
>>> >>>
>>> >>> Some possibilities:
>>> >>>
>>> >>> You didn't adjust Cassandra heap size in cassandra.in.sh (1GB is too
>>> >>> small)
>>> >>> You're inserting at CL.ZERO (ROW-MUTATION-STAGE in tpstats will show
>>> >>> large pending ops -- large = 100s)
>>> >>> You're creating large rows a bit at a time and Cassandra OOMs when it
>>> >>> tries to compact (the oom should usually be in the compaction thread)
>>> >>> You have your 5 disks each with a separate data directory, which will
>>> >>> allow up to 12 total memtables in-flight internally, and 12*256 is
>>> too
>>> >>> much for the heap size you have (FLUSH-WRITER-STAGE in tpstats will
>>> >>> show large pending ops -- large = more than 2 or 3)
>>> >>>
>>> >>> On Tue, May 18, 2010 at 6:24 AM, Ian Soboroff <is...@gmail.com>
>>> >>> wrote:
>>> >>> > I hope this isn't too much of a newbie question.  I am using
>>> Cassandra
>>> >>> > 0.6.1
>>> >>> > on a small cluster of Linux boxes - 14 nodes, each with 8GB RAM and
>>> 5
>>> >>> > data
>>> >>> > drives.  The nodes are running HDFS to serve files within the
>>> cluster,
>>> >>> > but
>>> >>> > at the moment the rest of Hadoop is shut down.  I'm trying to load
>>> a
>>> >>> > large
>>> >>> > set of web pages (the ClueWeb collection, but more is coming) and
>>> my
>>> >>> > Cassandra daemons keep dying.
>>> >>> >
>>> >>> > I'm loading the pages into a simple column family that lets me
>>> fetch
>>> >>> > out
>>> >>> > pages by an internal ID or by URL.  The biggest thing in the row is
>>> the
>>> >>> > page
>>> >>> > content, maybe 15-20k per page of raw HTML.  There aren't a lot of
>>> >>> > columns.
>>> >>> > I tried Thrift, Hector, and the BMT interface, and at the moment
>>> I'm
>>> >>> > doing
>>> >>> > batch mutations over Thrift, about 2500 pages per batch, because
>>> that
>>> >>> > was
>>> >>> > fastest for me in testing.
>>> >>> >
>>> >>> > At this point, each Cassandra node has between 500GB and 1.5TB
>>> >>> > according to
>>> >>> > nodetool ring.  Let's say I start the daemons up, and they all go
>>> live
>>> >>> > after
>>> >>> > a couple minutes of scanning the tables.  I then start my importer,
>>> >>> > which is
>>> >>> > a single Java process reading Clueweb bundles over HDFS, cutting
>>> them
>>> >>> > up,
>>> >>> > and sending the mutations to Cassandra.  I only talk to one node at
>>> a
>>> >>> > time,
>>> >>> > switching to a new node when I get an exception.  As the job runs
>>> over
>>> >>> > a few
>>> >>> > hours, the Cassandra daemons eventually fall over, either with no
>>> error
>>> >>> > in
>>> >>> > the log or reporting that they are out of heap.
>>> >>> >
>>> >>> > Each daemon is getting 6GB of RAM and has scads of disk space to
>>> play
>>> >>> > with.
>>> >>> > I've set the storage-conf.xml to take 256MB in a memtable before
>>> >>> > flushing
>>> >>> > (like the BMT case), and to do batch commit log flushes, and to not
>>> >>> > have any
>>> >>> > caching in the CFs.  I'm sure I must be tuning something wrong.  I
>>> >>> > would
>>> >>> > eventually like this Cassandra setup to serve a light request load
>>> but
>>> >>> > over
>>> >>> > say 50-100 TB of data.  I'd appreciate any help or advice you can
>>> >>> > offer.
>>> >>> >
>>> >>> > Thanks,
>>> >>> > Ian
>>> >>> >
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> Jonathan Ellis
>>> >>> Project Chair, Apache Cassandra
>>> >>> co-founder of Riptano, the source for professional Cassandra support
>>> >>> http://riptano.com
>>> >>
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Jonathan Ellis
>>> Project Chair, Apache Cassandra
>>> co-founder of Riptano, the source for professional Cassandra support
>>> http://riptano.com
>>>
>>
>>
>

Re: Scaling problems

Posted by Jonathan Ellis <jb...@gmail.com>.
No, it's really not designed to be a "leave the nodes down while I do
a ton of inserts."

(a) HH schema creates a column per hinted row, so you'll hit the 2GB
row limit sooner or later
(b) it goes through the hints hourly in case it missed a gossip Up notification

On Sat, May 22, 2010 at 9:07 PM, Ian Soboroff <is...@gmail.com> wrote:
> I'll try this.  HH backs up because nodes are failing.  I haven't read the
> code, but why should HH suck CPU?  As I understand it, there's nothing to
> hand off until the destination comes back up, and Gossip should tell us
> that, no?  In the interim, it's just a cache of writes waiting to be sent.
>
> Is there some way to tell the system "Just stop caring, I'm just writing,
> let's worry about leveling out when I get around to wanting to read?"
>
> Ian
>
> On Fri, May 21, 2010 at 9:06 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>
>> On Fri, May 21, 2010 at 9:09 AM, Ian Soboroff <is...@gmail.com> wrote:
>> > HINTED-HANDOFF-POOL               1       158             23
>>
>> this is your smoking gun.  HH tasks suck a ton of CPU and you have 158
>> backed up.
>>
>> i would just blow the HH files away from your data/system directory,
>> restart the node, and run repair (assuming all your other nodes are
>> alive again).
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Scaling problems

Posted by Ian Soboroff <is...@gmail.com>.
I'll try this.  HH backs up because nodes are failing.  I haven't read the
code, but why should HH suck CPU?  As I understand it, there's nothing to
hand off until the destination comes back up, and Gossip should tell us
that, no?  In the interim, it's just a cache of writes waiting to be sent.

Is there some way to tell the system "Just stop caring, I'm just writing,
let's worry about leveling out when I get around to wanting to read?"

Ian

On Fri, May 21, 2010 at 9:06 PM, Jonathan Ellis <jb...@gmail.com> wrote:

> On Fri, May 21, 2010 at 9:09 AM, Ian Soboroff <is...@gmail.com> wrote:
> > HINTED-HANDOFF-POOL               1       158             23
>
> this is your smoking gun.  HH tasks suck a ton of CPU and you have 158
> backed up.
>
> i would just blow the HH files away from your data/system directory,
> restart the node, and run repair (assuming all your other nodes are
> alive again).
>

Re: Scaling problems

Posted by Jonathan Ellis <jb...@gmail.com>.
On Fri, May 21, 2010 at 9:09 AM, Ian Soboroff <is...@gmail.com> wrote:
> HINTED-HANDOFF-POOL               1       158             23

this is your smoking gun.  HH tasks suck a ton of CPU and you have 158
backed up.

i would just blow the HH files away from your data/system directory,
restart the node, and run repair (assuming all your other nodes are
alive again).

Re: Scaling problems

Posted by Ian Soboroff <is...@gmail.com>.
So at the moment, I'm not running my loader, and I'm looking at one node
which is slow to respond to nodetool requests.  At this point, it has a pile
of hinted-handoffs pending which don't seem to be draining out.  The
system.log shows that it's GCing pretty much constantly.
Ian


$ /usr/local/src/cassandra/bin/nodetool --host node7 tpstats
Pool Name                    Active   Pending      Completed
FILEUTILS-DELETE-POOL             0         0            178
STREAM-STAGE                      0         0              0
RESPONSE-STAGE                    0         0          21852
ROW-READ-STAGE                    0         0              0
LB-OPERATIONS                     0         0              0
MESSAGE-DESERIALIZER-POOL         0         0        1648536
GMFD                              0         0         125430
LB-TARGET                         0         0              0
CONSISTENCY-MANAGER               0         0              0
ROW-MUTATION-STAGE                2         2        1886537
MESSAGE-STREAMING-POOL            0         0              0
LOAD-BALANCER-STAGE               0         0              0
FLUSH-SORTER-POOL                 0         0              0
MEMTABLE-POST-FLUSHER             0         0            206
FLUSH-WRITER-POOL                 0         0            206
AE-SERVICE-STAGE                  0         0              0
HINTED-HANDOFF-POOL               1       158             23


On Fri, May 21, 2010 at 10:37 AM, Ian Soboroff <is...@gmail.com> wrote:

> On the to-do list for today.  Is there a tool to aggregate all  the JMX
> stats from all nodes?  I mean, something a little more complete than nagios.
> Ian
>
>
> On Fri, May 21, 2010 at 10:23 AM, Jonathan Ellis <jb...@gmail.com>wrote:
>
>> you should check the jmx stages I posted about
>>
>> On Fri, May 21, 2010 at 7:05 AM, Ian Soboroff <is...@gmail.com>
>> wrote:
>> > Just an update.  I rolled the memtable size back to 128MB.  I am still
>> > seeing that the daemon runs for a while with reasonable heap usage, but
>> then
>> > the heap climbs up to the max (6GB in this case, should be plenty) and
>> it
>> > starts GCing, without much getting cleared.  The client catches lots of
>> > exceptions, where I wait 30 seconds and try again, with a new client if
>> > necessary, but it doesn't clear up.
>> >
>> > Could this be related to memory leak problems I've skimmed past on the
>> list
>> > here?
>> >
>> > It can't be that I'm creating rows a bit at a time... once I stick a web
>> > page into two CFs, it's over and done with for this application.  I'm
>> just
>> > trying to get stuff loaded.
>> >
>> > Is there a limit to how much on-disk data a Cassandra daemon can
>> manage?  Is
>> > there runtime overhead associated with stuff on disk?
>> >
>> > Ian
>> >
>> > On Thu, May 20, 2010 at 9:31 PM, Ian Soboroff <is...@gmail.com>
>> wrote:
>> >>
>> >> Excellent leads, thanks.  cassandra.in.sh has a heap of 6GB, but I
>> didn't
>> >> realize that I was trying to float so many memtables.  I'll poke
>> tomorrow
>> >> and report if it gets fixed.
>> >> Ian
>> >>
>> >> On Thu, May 20, 2010 at 10:40 AM, Jonathan Ellis <jb...@gmail.com>
>> >> wrote:
>> >>>
>> >>> Some possibilities:
>> >>>
>> >>> You didn't adjust Cassandra heap size in cassandra.in.sh (1GB is too
>> >>> small)
>> >>> You're inserting at CL.ZERO (ROW-MUTATION-STAGE in tpstats will show
>> >>> large pending ops -- large = 100s)
>> >>> You're creating large rows a bit at a time and Cassandra OOMs when it
>> >>> tries to compact (the oom should usually be in the compaction thread)
>> >>> You have your 5 disks each with a separate data directory, which will
>> >>> allow up to 12 total memtables in-flight internally, and 12*256 is too
>> >>> much for the heap size you have (FLUSH-WRITER-STAGE in tpstats will
>> >>> show large pending ops -- large = more than 2 or 3)
>> >>>
>> >>> On Tue, May 18, 2010 at 6:24 AM, Ian Soboroff <is...@gmail.com>
>> >>> wrote:
>> >>> > I hope this isn't too much of a newbie question.  I am using
>> Cassandra
>> >>> > 0.6.1
>> >>> > on a small cluster of Linux boxes - 14 nodes, each with 8GB RAM and
>> 5
>> >>> > data
>> >>> > drives.  The nodes are running HDFS to serve files within the
>> cluster,
>> >>> > but
>> >>> > at the moment the rest of Hadoop is shut down.  I'm trying to load a
>> >>> > large
>> >>> > set of web pages (the ClueWeb collection, but more is coming) and my
>> >>> > Cassandra daemons keep dying.
>> >>> >
>> >>> > I'm loading the pages into a simple column family that lets me fetch
>> >>> > out
>> >>> > pages by an internal ID or by URL.  The biggest thing in the row is
>> the
>> >>> > page
>> >>> > content, maybe 15-20k per page of raw HTML.  There aren't a lot of
>> >>> > columns.
>> >>> > I tried Thrift, Hector, and the BMT interface, and at the moment I'm
>> >>> > doing
>> >>> > batch mutations over Thrift, about 2500 pages per batch, because
>> that
>> >>> > was
>> >>> > fastest for me in testing.
>> >>> >
>> >>> > At this point, each Cassandra node has between 500GB and 1.5TB
>> >>> > according to
>> >>> > nodetool ring.  Let's say I start the daemons up, and they all go
>> live
>> >>> > after
>> >>> > a couple minutes of scanning the tables.  I then start my importer,
>> >>> > which is
>> >>> > a single Java process reading Clueweb bundles over HDFS, cutting
>> them
>> >>> > up,
>> >>> > and sending the mutations to Cassandra.  I only talk to one node at
>> a
>> >>> > time,
>> >>> > switching to a new node when I get an exception.  As the job runs
>> over
>> >>> > a few
>> >>> > hours, the Cassandra daemons eventually fall over, either with no
>> error
>> >>> > in
>> >>> > the log or reporting that they are out of heap.
>> >>> >
>> >>> > Each daemon is getting 6GB of RAM and has scads of disk space to
>> play
>> >>> > with.
>> >>> > I've set the storage-conf.xml to take 256MB in a memtable before
>> >>> > flushing
>> >>> > (like the BMT case), and to do batch commit log flushes, and to not
>> >>> > have any
>> >>> > caching in the CFs.  I'm sure I must be tuning something wrong.  I
>> >>> > would
>> >>> > eventually like this Cassandra setup to serve a light request load
>> but
>> >>> > over
>> >>> > say 50-100 TB of data.  I'd appreciate any help or advice you can
>> >>> > offer.
>> >>> >
>> >>> > Thanks,
>> >>> > Ian
>> >>> >
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Jonathan Ellis
>> >>> Project Chair, Apache Cassandra
>> >>> co-founder of Riptano, the source for professional Cassandra support
>> >>> http://riptano.com
>> >>
>> >
>> >
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of Riptano, the source for professional Cassandra support
>> http://riptano.com
>>
>
>

Re: Scaling problems

Posted by Ian Soboroff <is...@gmail.com>.
On the to-do list for today.  Is there a tool to aggregate all  the JMX
stats from all nodes?  I mean, something a little more complete than nagios.
Ian

On Fri, May 21, 2010 at 10:23 AM, Jonathan Ellis <jb...@gmail.com> wrote:

> you should check the jmx stages I posted about
>
> On Fri, May 21, 2010 at 7:05 AM, Ian Soboroff <is...@gmail.com> wrote:
> > Just an update.  I rolled the memtable size back to 128MB.  I am still
> > seeing that the daemon runs for a while with reasonable heap usage, but
> then
> > the heap climbs up to the max (6GB in this case, should be plenty) and it
> > starts GCing, without much getting cleared.  The client catches lots of
> > exceptions, where I wait 30 seconds and try again, with a new client if
> > necessary, but it doesn't clear up.
> >
> > Could this be related to memory leak problems I've skimmed past on the
> list
> > here?
> >
> > It can't be that I'm creating rows a bit at a time... once I stick a web
> > page into two CFs, it's over and done with for this application.  I'm
> just
> > trying to get stuff loaded.
> >
> > Is there a limit to how much on-disk data a Cassandra daemon can manage?
> Is
> > there runtime overhead associated with stuff on disk?
> >
> > Ian
> >
> > On Thu, May 20, 2010 at 9:31 PM, Ian Soboroff <is...@gmail.com>
> wrote:
> >>
> >> Excellent leads, thanks.  cassandra.in.sh has a heap of 6GB, but I
> didn't
> >> realize that I was trying to float so many memtables.  I'll poke
> tomorrow
> >> and report if it gets fixed.
> >> Ian
> >>
> >> On Thu, May 20, 2010 at 10:40 AM, Jonathan Ellis <jb...@gmail.com>
> >> wrote:
> >>>
> >>> Some possibilities:
> >>>
> >>> You didn't adjust Cassandra heap size in cassandra.in.sh (1GB is too
> >>> small)
> >>> You're inserting at CL.ZERO (ROW-MUTATION-STAGE in tpstats will show
> >>> large pending ops -- large = 100s)
> >>> You're creating large rows a bit at a time and Cassandra OOMs when it
> >>> tries to compact (the oom should usually be in the compaction thread)
> >>> You have your 5 disks each with a separate data directory, which will
> >>> allow up to 12 total memtables in-flight internally, and 12*256 is too
> >>> much for the heap size you have (FLUSH-WRITER-STAGE in tpstats will
> >>> show large pending ops -- large = more than 2 or 3)
> >>>
> >>> On Tue, May 18, 2010 at 6:24 AM, Ian Soboroff <is...@gmail.com>
> >>> wrote:
> >>> > I hope this isn't too much of a newbie question.  I am using
> Cassandra
> >>> > 0.6.1
> >>> > on a small cluster of Linux boxes - 14 nodes, each with 8GB RAM and 5
> >>> > data
> >>> > drives.  The nodes are running HDFS to serve files within the
> cluster,
> >>> > but
> >>> > at the moment the rest of Hadoop is shut down.  I'm trying to load a
> >>> > large
> >>> > set of web pages (the ClueWeb collection, but more is coming) and my
> >>> > Cassandra daemons keep dying.
> >>> >
> >>> > I'm loading the pages into a simple column family that lets me fetch
> >>> > out
> >>> > pages by an internal ID or by URL.  The biggest thing in the row is
> the
> >>> > page
> >>> > content, maybe 15-20k per page of raw HTML.  There aren't a lot of
> >>> > columns.
> >>> > I tried Thrift, Hector, and the BMT interface, and at the moment I'm
> >>> > doing
> >>> > batch mutations over Thrift, about 2500 pages per batch, because that
> >>> > was
> >>> > fastest for me in testing.
> >>> >
> >>> > At this point, each Cassandra node has between 500GB and 1.5TB
> >>> > according to
> >>> > nodetool ring.  Let's say I start the daemons up, and they all go
> live
> >>> > after
> >>> > a couple minutes of scanning the tables.  I then start my importer,
> >>> > which is
> >>> > a single Java process reading Clueweb bundles over HDFS, cutting them
> >>> > up,
> >>> > and sending the mutations to Cassandra.  I only talk to one node at a
> >>> > time,
> >>> > switching to a new node when I get an exception.  As the job runs
> over
> >>> > a few
> >>> > hours, the Cassandra daemons eventually fall over, either with no
> error
> >>> > in
> >>> > the log or reporting that they are out of heap.
> >>> >
> >>> > Each daemon is getting 6GB of RAM and has scads of disk space to play
> >>> > with.
> >>> > I've set the storage-conf.xml to take 256MB in a memtable before
> >>> > flushing
> >>> > (like the BMT case), and to do batch commit log flushes, and to not
> >>> > have any
> >>> > caching in the CFs.  I'm sure I must be tuning something wrong.  I
> >>> > would
> >>> > eventually like this Cassandra setup to serve a light request load
> but
> >>> > over
> >>> > say 50-100 TB of data.  I'd appreciate any help or advice you can
> >>> > offer.
> >>> >
> >>> > Thanks,
> >>> > Ian
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> Jonathan Ellis
> >>> Project Chair, Apache Cassandra
> >>> co-founder of Riptano, the source for professional Cassandra support
> >>> http://riptano.com
> >>
> >
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>

Re: Scaling problems

Posted by Jonathan Ellis <jb...@gmail.com>.
you should check the jmx stages I posted about

On Fri, May 21, 2010 at 7:05 AM, Ian Soboroff <is...@gmail.com> wrote:
> Just an update.  I rolled the memtable size back to 128MB.  I am still
> seeing that the daemon runs for a while with reasonable heap usage, but then
> the heap climbs up to the max (6GB in this case, should be plenty) and it
> starts GCing, without much getting cleared.  The client catches lots of
> exceptions, where I wait 30 seconds and try again, with a new client if
> necessary, but it doesn't clear up.
>
> Could this be related to memory leak problems I've skimmed past on the list
> here?
>
> It can't be that I'm creating rows a bit at a time... once I stick a web
> page into two CFs, it's over and done with for this application.  I'm just
> trying to get stuff loaded.
>
> Is there a limit to how much on-disk data a Cassandra daemon can manage?  Is
> there runtime overhead associated with stuff on disk?
>
> Ian
>
> On Thu, May 20, 2010 at 9:31 PM, Ian Soboroff <is...@gmail.com> wrote:
>>
>> Excellent leads, thanks.  cassandra.in.sh has a heap of 6GB, but I didn't
>> realize that I was trying to float so many memtables.  I'll poke tomorrow
>> and report if it gets fixed.
>> Ian
>>
>> On Thu, May 20, 2010 at 10:40 AM, Jonathan Ellis <jb...@gmail.com>
>> wrote:
>>>
>>> Some possibilities:
>>>
>>> You didn't adjust Cassandra heap size in cassandra.in.sh (1GB is too
>>> small)
>>> You're inserting at CL.ZERO (ROW-MUTATION-STAGE in tpstats will show
>>> large pending ops -- large = 100s)
>>> You're creating large rows a bit at a time and Cassandra OOMs when it
>>> tries to compact (the oom should usually be in the compaction thread)
>>> You have your 5 disks each with a separate data directory, which will
>>> allow up to 12 total memtables in-flight internally, and 12*256 is too
>>> much for the heap size you have (FLUSH-WRITER-STAGE in tpstats will
>>> show large pending ops -- large = more than 2 or 3)
>>>
>>> On Tue, May 18, 2010 at 6:24 AM, Ian Soboroff <is...@gmail.com>
>>> wrote:
>>> > I hope this isn't too much of a newbie question.  I am using Cassandra
>>> > 0.6.1
>>> > on a small cluster of Linux boxes - 14 nodes, each with 8GB RAM and 5
>>> > data
>>> > drives.  The nodes are running HDFS to serve files within the cluster,
>>> > but
>>> > at the moment the rest of Hadoop is shut down.  I'm trying to load a
>>> > large
>>> > set of web pages (the ClueWeb collection, but more is coming) and my
>>> > Cassandra daemons keep dying.
>>> >
>>> > I'm loading the pages into a simple column family that lets me fetch
>>> > out
>>> > pages by an internal ID or by URL.  The biggest thing in the row is the
>>> > page
>>> > content, maybe 15-20k per page of raw HTML.  There aren't a lot of
>>> > columns.
>>> > I tried Thrift, Hector, and the BMT interface, and at the moment I'm
>>> > doing
>>> > batch mutations over Thrift, about 2500 pages per batch, because that
>>> > was
>>> > fastest for me in testing.
>>> >
>>> > At this point, each Cassandra node has between 500GB and 1.5TB
>>> > according to
>>> > nodetool ring.  Let's say I start the daemons up, and they all go live
>>> > after
>>> > a couple minutes of scanning the tables.  I then start my importer,
>>> > which is
>>> > a single Java process reading Clueweb bundles over HDFS, cutting them
>>> > up,
>>> > and sending the mutations to Cassandra.  I only talk to one node at a
>>> > time,
>>> > switching to a new node when I get an exception.  As the job runs over
>>> > a few
>>> > hours, the Cassandra daemons eventually fall over, either with no error
>>> > in
>>> > the log or reporting that they are out of heap.
>>> >
>>> > Each daemon is getting 6GB of RAM and has scads of disk space to play
>>> > with.
>>> > I've set the storage-conf.xml to take 256MB in a memtable before
>>> > flushing
>>> > (like the BMT case), and to do batch commit log flushes, and to not
>>> > have any
>>> > caching in the CFs.  I'm sure I must be tuning something wrong.  I
>>> > would
>>> > eventually like this Cassandra setup to serve a light request load but
>>> > over
>>> > say 50-100 TB of data.  I'd appreciate any help or advice you can
>>> > offer.
>>> >
>>> > Thanks,
>>> > Ian
>>> >
>>>
>>>
>>>
>>> --
>>> Jonathan Ellis
>>> Project Chair, Apache Cassandra
>>> co-founder of Riptano, the source for professional Cassandra support
>>> http://riptano.com
>>
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Scaling problems

Posted by Ian Soboroff <is...@gmail.com>.
Just an update.  I rolled the memtable size back to 128MB.  I am still
seeing that the daemon runs for a while with reasonable heap usage, but then
the heap climbs up to the max (6GB in this case, should be plenty) and it
starts GCing, without much getting cleared.  The client catches lots of
exceptions, where I wait 30 seconds and try again, with a new client if
necessary, but it doesn't clear up.

Could this be related to memory leak problems I've skimmed past on the list
here?

It can't be that I'm creating rows a bit at a time... once I stick a web
page into two CFs, it's over and done with for this application.  I'm just
trying to get stuff loaded.

Is there a limit to how much on-disk data a Cassandra daemon can manage?  Is
there runtime overhead associated with stuff on disk?

Ian

On Thu, May 20, 2010 at 9:31 PM, Ian Soboroff <is...@gmail.com> wrote:

> Excellent leads, thanks.  cassandra.in.sh has a heap of 6GB, but I didn't
> realize that I was trying to float so many memtables.  I'll poke tomorrow
> and report if it gets fixed.
> Ian
>
>
> On Thu, May 20, 2010 at 10:40 AM, Jonathan Ellis <jb...@gmail.com>wrote:
>
>> Some possibilities:
>>
>> You didn't adjust Cassandra heap size in cassandra.in.sh (1GB is too
>> small)
>> You're inserting at CL.ZERO (ROW-MUTATION-STAGE in tpstats will show
>> large pending ops -- large = 100s)
>> You're creating large rows a bit at a time and Cassandra OOMs when it
>> tries to compact (the oom should usually be in the compaction thread)
>> You have your 5 disks each with a separate data directory, which will
>> allow up to 12 total memtables in-flight internally, and 12*256 is too
>> much for the heap size you have (FLUSH-WRITER-STAGE in tpstats will
>> show large pending ops -- large = more than 2 or 3)
>>
>> On Tue, May 18, 2010 at 6:24 AM, Ian Soboroff <is...@gmail.com>
>> wrote:
>> > I hope this isn't too much of a newbie question.  I am using Cassandra
>> 0.6.1
>> > on a small cluster of Linux boxes - 14 nodes, each with 8GB RAM and 5
>> data
>> > drives.  The nodes are running HDFS to serve files within the cluster,
>> but
>> > at the moment the rest of Hadoop is shut down.  I'm trying to load a
>> large
>> > set of web pages (the ClueWeb collection, but more is coming) and my
>> > Cassandra daemons keep dying.
>> >
>> > I'm loading the pages into a simple column family that lets me fetch out
>> > pages by an internal ID or by URL.  The biggest thing in the row is the
>> page
>> > content, maybe 15-20k per page of raw HTML.  There aren't a lot of
>> columns.
>> > I tried Thrift, Hector, and the BMT interface, and at the moment I'm
>> doing
>> > batch mutations over Thrift, about 2500 pages per batch, because that
>> was
>> > fastest for me in testing.
>> >
>> > At this point, each Cassandra node has between 500GB and 1.5TB according
>> to
>> > nodetool ring.  Let's say I start the daemons up, and they all go live
>> after
>> > a couple minutes of scanning the tables.  I then start my importer,
>> which is
>> > a single Java process reading Clueweb bundles over HDFS, cutting them
>> up,
>> > and sending the mutations to Cassandra.  I only talk to one node at a
>> time,
>> > switching to a new node when I get an exception.  As the job runs over a
>> few
>> > hours, the Cassandra daemons eventually fall over, either with no error
>> in
>> > the log or reporting that they are out of heap.
>> >
>> > Each daemon is getting 6GB of RAM and has scads of disk space to play
>> with.
>> > I've set the storage-conf.xml to take 256MB in a memtable before
>> flushing
>> > (like the BMT case), and to do batch commit log flushes, and to not have
>> any
>> > caching in the CFs.  I'm sure I must be tuning something wrong.  I would
>> > eventually like this Cassandra setup to serve a light request load but
>> over
>> > say 50-100 TB of data.  I'd appreciate any help or advice you can offer.
>> >
>> > Thanks,
>> > Ian
>> >
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of Riptano, the source for professional Cassandra support
>> http://riptano.com
>>
>
>

Re: Scaling problems

Posted by Ian Soboroff <is...@gmail.com>.
Excellent leads, thanks.  cassandra.in.sh has a heap of 6GB, but I didn't
realize that I was trying to float so many memtables.  I'll poke tomorrow
and report if it gets fixed.
Ian

On Thu, May 20, 2010 at 10:40 AM, Jonathan Ellis <jb...@gmail.com> wrote:

> Some possibilities:
>
> You didn't adjust Cassandra heap size in cassandra.in.sh (1GB is too
> small)
> You're inserting at CL.ZERO (ROW-MUTATION-STAGE in tpstats will show
> large pending ops -- large = 100s)
> You're creating large rows a bit at a time and Cassandra OOMs when it
> tries to compact (the oom should usually be in the compaction thread)
> You have your 5 disks each with a separate data directory, which will
> allow up to 12 total memtables in-flight internally, and 12*256 is too
> much for the heap size you have (FLUSH-WRITER-STAGE in tpstats will
> show large pending ops -- large = more than 2 or 3)
>
> On Tue, May 18, 2010 at 6:24 AM, Ian Soboroff <is...@gmail.com> wrote:
> > I hope this isn't too much of a newbie question.  I am using Cassandra
> 0.6.1
> > on a small cluster of Linux boxes - 14 nodes, each with 8GB RAM and 5
> data
> > drives.  The nodes are running HDFS to serve files within the cluster,
> but
> > at the moment the rest of Hadoop is shut down.  I'm trying to load a
> large
> > set of web pages (the ClueWeb collection, but more is coming) and my
> > Cassandra daemons keep dying.
> >
> > I'm loading the pages into a simple column family that lets me fetch out
> > pages by an internal ID or by URL.  The biggest thing in the row is the
> page
> > content, maybe 15-20k per page of raw HTML.  There aren't a lot of
> columns.
> > I tried Thrift, Hector, and the BMT interface, and at the moment I'm
> doing
> > batch mutations over Thrift, about 2500 pages per batch, because that was
> > fastest for me in testing.
> >
> > At this point, each Cassandra node has between 500GB and 1.5TB according
> to
> > nodetool ring.  Let's say I start the daemons up, and they all go live
> after
> > a couple minutes of scanning the tables.  I then start my importer, which
> is
> > a single Java process reading Clueweb bundles over HDFS, cutting them up,
> > and sending the mutations to Cassandra.  I only talk to one node at a
> time,
> > switching to a new node when I get an exception.  As the job runs over a
> few
> > hours, the Cassandra daemons eventually fall over, either with no error
> in
> > the log or reporting that they are out of heap.
> >
> > Each daemon is getting 6GB of RAM and has scads of disk space to play
> with.
> > I've set the storage-conf.xml to take 256MB in a memtable before flushing
> > (like the BMT case), and to do batch commit log flushes, and to not have
> any
> > caching in the CFs.  I'm sure I must be tuning something wrong.  I would
> > eventually like this Cassandra setup to serve a light request load but
> over
> > say 50-100 TB of data.  I'd appreciate any help or advice you can offer.
> >
> > Thanks,
> > Ian
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>

Re: Scaling problems

Posted by Jonathan Ellis <jb...@gmail.com>.
Some possibilities:

You didn't adjust Cassandra heap size in cassandra.in.sh (1GB is too small)
You're inserting at CL.ZERO (ROW-MUTATION-STAGE in tpstats will show
large pending ops -- large = 100s)
You're creating large rows a bit at a time and Cassandra OOMs when it
tries to compact (the oom should usually be in the compaction thread)
You have your 5 disks each with a separate data directory, which will
allow up to 12 total memtables in-flight internally, and 12*256 is too
much for the heap size you have (FLUSH-WRITER-STAGE in tpstats will
show large pending ops -- large = more than 2 or 3)

On Tue, May 18, 2010 at 6:24 AM, Ian Soboroff <is...@gmail.com> wrote:
> I hope this isn't too much of a newbie question.  I am using Cassandra 0.6.1
> on a small cluster of Linux boxes - 14 nodes, each with 8GB RAM and 5 data
> drives.  The nodes are running HDFS to serve files within the cluster, but
> at the moment the rest of Hadoop is shut down.  I'm trying to load a large
> set of web pages (the ClueWeb collection, but more is coming) and my
> Cassandra daemons keep dying.
>
> I'm loading the pages into a simple column family that lets me fetch out
> pages by an internal ID or by URL.  The biggest thing in the row is the page
> content, maybe 15-20k per page of raw HTML.  There aren't a lot of columns.
> I tried Thrift, Hector, and the BMT interface, and at the moment I'm doing
> batch mutations over Thrift, about 2500 pages per batch, because that was
> fastest for me in testing.
>
> At this point, each Cassandra node has between 500GB and 1.5TB according to
> nodetool ring.  Let's say I start the daemons up, and they all go live after
> a couple minutes of scanning the tables.  I then start my importer, which is
> a single Java process reading Clueweb bundles over HDFS, cutting them up,
> and sending the mutations to Cassandra.  I only talk to one node at a time,
> switching to a new node when I get an exception.  As the job runs over a few
> hours, the Cassandra daemons eventually fall over, either with no error in
> the log or reporting that they are out of heap.
>
> Each daemon is getting 6GB of RAM and has scads of disk space to play with.
> I've set the storage-conf.xml to take 256MB in a memtable before flushing
> (like the BMT case), and to do batch commit log flushes, and to not have any
> caching in the CFs.  I'm sure I must be tuning something wrong.  I would
> eventually like this Cassandra setup to serve a light request load but over
> say 50-100 TB of data.  I'd appreciate any help or advice you can offer.
>
> Thanks,
> Ian
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com