You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Filippo Diotalevi <fi...@ntoklo.com> on 2012/05/01 17:58:14 UTC
How Cassandra determines the splits
Hi,
I'm having problems in my Cassandra/Hadoop (1.0.8 + cdh3u3) cluster related to how cassandra splits the data to be processed by Hadoop.
I'm currently testing a map reduce job, starting from a CF of roughly 1500 rows, with
cassandra.input.split.size 10
cassandra.range.batch.size 1
but what I consistently see is that, while most of the task have 1-20 rows assigned each, one of them is assigned 400+ rows, which gives me all sort of problems in terms of timeouts and memory consumption (not to mention seeing the mapper progress bar going to 4000% and more).
Do you have any suggestion to solve/troublehsoot this issue?
--
Filippo Diotalevi
Re: How Cassandra determines the splits
Posted by Patrik Modesto <pa...@gmail.com>.
Hi,
I had a simillar problem with Cassandra 0.8.x and the problem was when
configured Cassandra with rpc_address: 0.0.0.0 and starting Hadoop job
from outside the Cassandra cluster. But with version 1.0.x the problem
is gone.
You can debug the splits with thrift. This is a copy&paste part of my
splits testing Python utility:
print "describe_ring"
res = client.describe_ring(argv[1])
for t in res:
print "%s - %s [%s] [%s]" % (t.start_token, t.end_token,
",".join(t.endpoints), ",".join(t.rpc_endpoints),)
for r in res:
res2 = client.describe_splits('PageData',
r.start_token, r.end_token,
24*1024)
It asks Cassandra for a list of nodes with their key ranges, then asks
each node for slits. You should adjust the 24*1024 split size.
Regards,
Patrik
On Tue, May 1, 2012 at 5:58 PM, Filippo Diotalevi <fi...@ntoklo.com> wrote:
> Hi,
> I'm having problems in my Cassandra/Hadoop (1.0.8 + cdh3u3) cluster related
> to how cassandra splits the data to be processed by Hadoop.
>
> I'm currently testing a map reduce job, starting from a CF of roughly 1500
> rows, with
>
> cassandra.input.split.size 10
> cassandra.range.batch.size 1
>
> but what I consistently see is that, while most of the task have 1-20 rows
> assigned each, one of them is assigned 400+ rows, which gives me all sort of
> problems in terms of timeouts and memory consumption (not to mention seeing
> the mapper progress bar going to 4000% and more).
>
> Do you have any suggestion to solve/troublehsoot this issue?
>
> --
> Filippo Diotalevi