You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Filippo Diotalevi <fi...@ntoklo.com> on 2012/05/01 17:58:14 UTC

How Cassandra determines the splits

Hi,
I'm having problems in my Cassandra/Hadoop (1.0.8 + cdh3u3) cluster related to how cassandra splits the data to be processed by Hadoop.

I'm currently testing a map reduce job, starting from a CF of roughly 1500 rows, with 

cassandra.input.split.size 10
cassandra.range.batch.size 1

but what I consistently see is that, while most of the task have 1-20 rows assigned each, one of them is assigned 400+ rows, which gives me all sort of problems in terms of timeouts and memory consumption (not to mention seeing the mapper progress bar going to 4000% and more).

Do you have any suggestion to solve/troublehsoot this issue?

-- 
Filippo Diotalevi

Re: How Cassandra determines the splits

Posted by Patrik Modesto <pa...@gmail.com>.
Hi,

I had a simillar problem with Cassandra 0.8.x and the problem was when
configured Cassandra with rpc_address: 0.0.0.0 and starting Hadoop job
from outside the Cassandra cluster. But with version 1.0.x the problem
is gone.

You can debug the splits with thrift. This is a copy&paste part of my
splits testing Python utility:

        print "describe_ring"
        res = client.describe_ring(argv[1])
        for t in res:
            print "%s - %s [%s] [%s]" % (t.start_token, t.end_token,
",".join(t.endpoints), ",".join(t.rpc_endpoints),)

        for r in res:
            res2 = client.describe_splits('PageData',
                    r.start_token, r.end_token,
                    24*1024)

It asks Cassandra for a list of nodes with their key ranges, then asks
each node for slits. You should adjust the 24*1024 split size.

Regards,
Patrik


On Tue, May 1, 2012 at 5:58 PM, Filippo Diotalevi <fi...@ntoklo.com> wrote:
> Hi,
> I'm having problems in my Cassandra/Hadoop (1.0.8 + cdh3u3) cluster related
> to how cassandra splits the data to be processed by Hadoop.
>
> I'm currently testing a map reduce job, starting from a CF of roughly 1500
> rows, with
>
> cassandra.input.split.size 10
> cassandra.range.batch.size 1
>
> but what I consistently see is that, while most of the task have 1-20 rows
> assigned each, one of them is assigned 400+ rows, which gives me all sort of
> problems in terms of timeouts and memory consumption (not to mention seeing
> the mapper progress bar going to 4000% and more).
>
> Do you have any suggestion to solve/troublehsoot this issue?
>
> --
> Filippo Diotalevi