You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Saladi Naidu <na...@yahoo.com> on 2015/08/27 15:43:58 UTC

Data Distribution in Table/Column Family

Is there a way to find out how data is distributed within column family by each node? Nodetool provides how data is distributed across nodes that only shows all the data by node. We are seeing heavy load on one node and I suspect that partitioning is not distributing data equally. But to prove that to development team we need to know the stats for that table Naidu Saladi 

Re: Data Distribution in Table/Column Family

Posted by Jack Krupansky <ja...@gmail.com>.
Even if the data were absolutely evenly distributed, that won't guarantee
that the hash values of the partition keys used in your client queries
won't collide a result in a hotspot.

Another possibility is that your data is not partitioned well at the
primary key level. Are you using clustering keys? Only the partition key
portion of the primary key is used to produce the hash/token value that
selects the node. Sometimes you need to use composite partition keys to
assure that primary keys will be better distributed for particular access
patterns.

-- Jack Krupansky

On Thu, Aug 27, 2015 at 11:03 AM, Alain RODRIGUEZ <ar...@gmail.com>
wrote:

> Hi,
>
> Did you try to run the following on all your nodes and compare ?
>
> du -sh /*whatever*/cassandra/data/*
>
> Of course if you have unequal snapshots sizes remove them in the above
> command (or directly remove them).
>
> This should answer (barely) your question about an eventual even
> distribution (/!\ having a few MB or GB deviation - depending on your total
> data size - might happen without this being a real issue, I would say up to
> 5-15 % on a big enough dataset)
>
> Also, "nodetool cfstats" give you an approximation of the number of rows
> and the space used (to run on each node) among other useful informations.
>
> But the main thing to do is to double check your tables model to see if
> your workflow could create a hotspot on any of those, you should be able to
> guess if one of your table is badly distributed imho.
>
> C*heers,
>
> Alain
>
> 2015-08-27 15:43 GMT+02:00 Saladi Naidu <na...@yahoo.com>:
>
>> Is there a way to find out how data is distributed within column family
>> by each node? Nodetool provides how data is distributed across nodes that
>> only shows all the data by node. We are seeing heavy load on one node and I
>> suspect that partitioning is not distributing data equally. But to prove
>> that to development team we need to know the stats for that table
>>
>> Naidu Saladi
>>
>
>

Re: Data Distribution in Table/Column Family

Posted by Alain RODRIGUEZ <ar...@gmail.com>.
Hi,

Did you try to run the following on all your nodes and compare ?

du -sh /*whatever*/cassandra/data/*

Of course if you have unequal snapshots sizes remove them in the above
command (or directly remove them).

This should answer (barely) your question about an eventual even
distribution (/!\ having a few MB or GB deviation - depending on your total
data size - might happen without this being a real issue, I would say up to
5-15 % on a big enough dataset)

Also, "nodetool cfstats" give you an approximation of the number of rows
and the space used (to run on each node) among other useful informations.

But the main thing to do is to double check your tables model to see if
your workflow could create a hotspot on any of those, you should be able to
guess if one of your table is badly distributed imho.

C*heers,

Alain

2015-08-27 15:43 GMT+02:00 Saladi Naidu <na...@yahoo.com>:

> Is there a way to find out how data is distributed within column family by
> each node? Nodetool provides how data is distributed across nodes that only
> shows all the data by node. We are seeing heavy load on one node and I
> suspect that partitioning is not distributing data equally. But to prove
> that to development team we need to know the stats for that table
>
> Naidu Saladi
>